Skip to content
Snippets Groups Projects
DataRegularisation.md 24.17 KiB

Data Regularisation

The tutorial aims to introduce the usage of SaQC methods, in order to obtain regularly sampled data derivatives from given time series data input. Regularly sampled time series data, is data, that that exhibits a constant temporal spacing in between subsequent data points.

Why

Often, measurement data does not come in regularly sampled time series. The reasons, why one usually would like to have time series data, that exhibits a constant temporal gap size in between subsequent measurements, are manifold.

The 2 foremost important ones, may be, that statistics, such as mean and standard deviation usually presuppose the set of data points, they are computed of, to be equally weighted.

The second reason, is, that, relating data of different sources to another, is impossible, if one has not a mapping at hand, that relates the different date time indices to each other. One easy and intuitive way of constructing such a mapping, is to just resample all data at the same (regular) timestamp.

Tutorial data

The following dataset of Soil Moisture meassurements may serve as example data set:

Lets import it via:

import pandas as pd
data = pd.read_csv(data_path, col_index=1)
data.index = pd.DatetimeIndex(data.index)

Now lets check out the imported data`s timestamps:

>>> data

                     SoilMoisture
Date Time                        
2021-01-01 00:09:07     23.429701
2021-01-01 00:18:55     23.431900
2021-01-01 00:28:42     23.343100
2021-01-01 00:38:30     23.476400
2021-01-01 00:48:18     23.343100
                           ...
2021-03-20 07:13:49    152.883102
2021-03-20 07:26:16    156.587906
2021-03-20 07:40:37    166.146194
2021-03-20 07:54:59    164.690598
2021-03-20 08:40:41    155.318893
[10607 rows x 1 columns]

The data series seems to start with a sampling rate of roughly 10 minutes. Somewhere the sampling rate changes, and at the end it seems to exhibit an intended sampling rate of 15 minutes.

Finding out about the proper sampling a series should be regularized to, is a subject on its own and wont be covered here. Usually, the intended sampling rate of sensor data is known from the specification of the sensor.

If that is not the case, and if there seem to be more than one candidates for a rate regularisation, a rough rule of thumb, aiming at minimisation of data loss and data manipulation, may be, to go for the smallest rate seemingly present in the data.

Regularisations

So lets transform the measurements timestamps to have a regular 10 minutes frequency. In order to do so, we have to decide what to do with each time stamps associated data, when we alter the timestamps value.

Basically, there are three types of :doc:regularisation <function_cats/regularisation> methods:

  1. We could keep the values as they are, and thus, just shift them in time to match the equidistant 10 minutes frequency grid, we want the data to exhibit.
  2. We could calculate new, synthetic data values for the regular timestamps, via an interpolation method.
  3. We could apply some aggregation to up- or down sample the data.

Shift

Lets apply a simple shift via the :py:func:saqc.shift <Functions.saqc.shift> method.

saqc = saqc.shift('SoilMoisture', target='SoilMoisture_bshift', freq='10min', method='bshift')

Target parameter

We selected a new target field, to store the shifted data to a new field, so that our original data wouldnt be overridden.

Freq parameter

We passed the freq keyword of the intended sampling frequency in terms of a date alias string. All of the :doc:regularisation <function_cats/regularisation> methods have such a frequency keyword, and it just determines the sampling rate, the resulting regular timeseries will have.

Shifting Method

With the method keyword, we determined the direction of the shift. We passed it the string bshift - which applies a backwards shift, so data points get shifted backwards, until they match a timestamp that is a multiple of 10 minutes. (See :py:func:saqc.shift <Functions.saqc.shift> documentation for more details on the keywords.)

Lets see, how the data is now sampled. Therefore, we use the raw output from the :py:meth:saqc.getResult <saqc.core.core.SaQC> method. This will prevent the methods output from being merged to a pandas.DataFrame object, and the changes from the resampling will be easier comprehensible from one look.:

Shifted data

>>> saqc = saqc.evaluate()
>>> data_serult = saqc.getResult(raw=True)[0]
>>> data_result

                    SoilMoisture |                       SoilMoisture_bshift | 
================================ | ========================================= | 
Date Time                        | Date Time                                 | 
2021-01-01 00:00:00    23.429701 | 2021-01-01 00:09:07             23.429701 | 
2021-01-01 00:10:00    23.431900 | 2021-01-01 00:18:55             23.431900 | 
2021-01-01 00:20:00    23.343100 | 2021-01-01 00:28:42             23.343100 | 
2021-01-01 00:30:00    23.476400 | 2021-01-01 00:38:30             23.476400 | 
2021-01-01 00:40:00    23.343100 | 2021-01-01 00:48:18             23.343100 | 
2021-01-01 00:50:00    23.298800 | 2021-01-01 00:58:06             23.298800 | 
2021-01-01 01:00:00    23.387400 | 2021-01-01 01:07:54             23.387400 | 
2021-01-01 01:10:00    23.343100 | 2021-01-01 01:17:41             23.343100 | 
2021-01-01 01:20:00    23.298800 | 2021-01-01 01:27:29             23.298800 | 
2021-01-01 01:30:00    23.343100 | 2021-01-01 01:37:17             23.343100 | 
                          ... | ...                                   ... | 
2021-03-20 07:20:00   156.587906 | 2021-03-20 05:07:02            137.271500 | 
2021-03-20 07:30:00          NaN | 2021-03-20 05:21:35            138.194107 | 
2021-03-20 07:40:00   166.146194 | 2021-03-20 05:41:59            154.116806 | 
2021-03-20 07:50:00   164.690598 | 2021-03-20 06:03:09            150.567505 | 
2021-03-20 08:00:00          NaN | 2021-03-20 06:58:10            145.027496 | 
2021-03-20 08:10:00          NaN | 2021-03-20 07:13:49            152.883102 | 
2021-03-20 08:20:00          NaN | 2021-03-20 07:26:16            156.587906 | 
2021-03-20 08:30:00          NaN | 2021-03-20 07:40:37            166.146194 | 
2021-03-20 08:40:00   155.318893 | 2021-03-20 07:54:59            164.690598 | 
[11286]                            [10607]     

We see, the first and last 10 datapoints of both, the original data time series and the shifted one.

Obveously, the shifted data series now exhibits a regular sampling rate of 10 minutes, with the index ranging from the latest timestamp, that is a multiple of 10 minutes and preceeds the initial timestamp of the original data, up to the first 10 minutes multiple, that succeeds the last original datas timestamp. This is default behavior to all the :doc:regularisations <../Functions/regularisation> provided by saqc.

Data Loss and Empty Intervals

The number of datapoints (displayed at the bottom of the table columns) has changed through the transformation as well. That change stems from 2 sources mainly:

Empty Intervals

If there is no valid data point available within an interval of the passed frequency, that could be shifted to match a multiple of the frequency, a NaN value gets inserted to represent the fact, that in the interval that is represented by that date time index, there was data missing.

Valid Data

Data points are referred to, as valid, in context of a regularisation, if:

  1. the data points value is not NaN

  2. the flag of that datapoint has a value lower than the value passed to the methods to_mask keyword - since this keyword defaults to the highest flag level available, defaultly, all data flagged :py:const:~saqc.constants.BAD, is considered invalid by that method.

Note, that, from point 2 above, it follows, that flagging data values before regularisation, will effectively exclude them from the regularistaion process. See chapter flagging and resampling for an example of this effect and how it can help control data reduction.

data reduction

If there are multiple values present within an interval with size according to the passed frequency alias passed to freq, this values get reduced to one single value, that will get assigned to the timestamp associated with the interval.

This reduction depends on the selected :doc:regularisation <../function_cats/regularisation> method.

For example, above, we applied a backwards :py:func:shift <Functions.saqc.shift> with a 10 minutes frequency. As a result, the first value, encountered after any multiple of 10 minutes, gets shifted backwards to be aligned with the desired frequency and any other value in that 10 minutes interval just gets discarded.

See the below chunk of our processed SoilMoisture data set to get an idea of the effect. There are 2 measurements within the 10 minutes interval ranging from 2021-01-01 07:30:00 to 2021-01-01 07:40:00 present in the original data - and only the first of the two reappears in the shifted data set, as representation for that interval.

>>> data_result['2021-01-01T07:00:00':'2021-01-01T08:00:00']

             SoilMoisture_bshift |                              SoilMoisture |
================================ | ========================================= |
Date Time                        | Date Time                                 |
2021-01-01 07:00:00      23.3431 | 2021-01-01 07:00:41               23.3431 |
2021-01-01 07:10:00      23.3431 | 2021-01-01 07:10:29               23.3431 |
2021-01-01 07:20:00      23.2988 | 2021-01-01 07:20:17               23.2988 |
2021-01-01 07:30:00      23.3874 | 2021-01-01 07:30:05               23.3874 |
2021-01-01 07:40:00      23.3431 | 2021-01-01 07:39:53               23.3853 |
2021-01-01 07:50:00      23.3874 | 2021-01-01 07:49:41               23.3431 |

Minimize Shifting Distance

Notice, how, for example, the data point for 2021-01-01 07:49:41 gets shifted all the way back, to 2021-01-01 07:40:00 - although, shifting it forward to 07:40:00 would be less a manipulation, since this timestamp appears to be closer to the original one.