-
David Schäfer authored80c1508f
Data Regularization
The tutorial aims to introduce the usage of SaQC
methods, in order to obtain regularly sampled data derivatives
from given time series data input. Regularly sampled time series data, is data, that exhibits a constant temporal
spacing in between subsequent data points.
In the following steps, the tutorial guides through the usage of the SaQC :doc:`resampling <../funcs/generic>` library.
- Initially, we introduce and motivate regularization techniques and we do import the tutorial data.
- :ref:`Why Regularization <cookbooks/DataRegularisation:Why Regularization?>`
- :ref:`Tutorial Data <cookbooks/DataRegularisation:Tutorial Data>`
- We will get an overview over the main :ref:`Regularization <cookbooks/DataRegularisation:regularization>` methods, starting with the shift.
- :ref:`Shift <cookbooks/DataRegularisation:shift>`
- :ref:`Target Parameter <cookbooks/DataRegularisation:target parameter>`
- :ref:`Freq Parameter <cookbooks/DataRegularisation:freq parameter>`
- :ref:`Method Parameter <cookbooks/DataRegularisation:shifting method>`
- :ref:`Valid Data <cookbooks/DataRegularisation:Valid Data>`
- We introduce the notion of valid data and see how sparse intervals and those with multiple values interact with
regularization.
- :ref:`Data Loss and Empty Intervals <cookbooks/DataRegularisation:data loss and empty intervals>`
- :ref:`Empty Intervals <cookbooks/DataRegularisation:empty intervals>`
- :ref:`Valid Data <cookbooks/DataRegularisation:Valid Data>`
- :ref:`Data Reduction <cookbooks/DataRegularisation:data reduction>`
- :ref:`Minimize Shifting <cookbooks/DataRegularisation:minimize shifting distance>`
- :ref:`Empty Intervals <cookbooks/DataRegularisation:empty intervals>`
- :ref:`Data Loss and Empty Intervals <cookbooks/DataRegularisation:data loss and empty intervals>`
- We use the Aggregation and the Interpolation method.
- :ref:`Aggregation <cookbooks/DataRegularisation:aggregation>`
- :ref:`Function Parameter <cookbooks/DataRegularisation:aggregation functions>`
- :ref:`Method Parameter <cookbooks/DataRegularisation:shifting method>`
- :ref:`Interpolation <cookbooks/DataRegularisation:interpolation>`
- :ref:`Representing Data Sparsity <cookbooks/DataRegularisation:interpolation and data sparsity>`
- :ref:`Aggregation <cookbooks/DataRegularisation:aggregation>`
- We see how regularization interacts with Flags.
- :ref:`Flags and Regularization <cookbooks/DataRegularisation:flags and regularization>`
Why Regularization?
Often, measurement data does not come in regularly sampled time series. The reasons, why one usually would like to have time series data, that exhibits a constant temporal gap size in between subsequent measurements, are manifold.
The 2 foremost important ones, may be, that statistics, such as mean and standard deviation usually presuppose the set of data points, they are computed of, to be equally weighted.
The second reason, is, that, relating data of different sources to another, is impossible, if one has not a mapping at hand, that relates the different date time indices to each other. One easy and intuitive way of constructing such a mapping, is to just resample all data at the same (regular) timestamp.
Tutorial Data
The following dataset of Soil Moisture measurements may serve as example data set:

Lets import it and check out the first and last lines. .. doctest:: example
>>> import pandas as pd >>> data_path = './resources/data/SoilMoisture.csv' >>> data = pd.read_csv(data_path, index_col=0) >>> data.index = pd.DatetimeIndex(data.index) >>> data SoilMoisture 2021-01-01 00:09:07 23.429701 2021-01-01 00:18:55 23.431900 2021-01-01 00:28:42 23.343100 2021-01-01 00:38:30 23.476400 2021-01-01 00:48:18 23.343100 ... ... 2021-03-20 07:13:49 152.883102 2021-03-20 07:26:16 156.587906 2021-03-20 07:40:37 166.146194 2021-03-20 07:54:59 164.690598 2021-03-20 08:40:41 155.318893 <BLANKLINE> [10607 rows x 1 columns]
The data series seems to start with a sampling rate of roughly 10 minutes. Somewhere the sampling rate changes, and at the end it seems to exhibit an intended sampling rate of 15 minutes.
Finding out about the proper sampling a series should be regularized to, is a subject on its own and wont be covered here. Usually, the intended sampling rate of sensor data is known from the specification of the sensor.
If that is not the case, and if there seem to be more than one candidates for a rate regularization, a rough rule of thumb, aiming at minimization of data loss and data manipulation, may be, to go for the smallest rate seemingly present in the data.
Regularization
So lets transform the measurements timestamps to have a regular 10 minutes frequency. In order to do so, we have to decide what to do with each time stamps associated data, when we alter the timestamps value.
Basically, there are three types of :doc:`regularization <../funcs/resampling>` methods:
- We could keep the values as they are, and thus, just :ref:`shift <cookbooks/DataRegularisation:Shift>` them in time to match the equidistant 10 minutes frequency grid, we want the data to exhibit.
- We could calculate new, synthetic data values for the regular timestamps, via an :ref:`interpolation <cookbooks/DataRegularisation:Interpolation>` method.
- We could apply some :ref:`aggregation <cookbooks/DataRegularisation:Aggregation>` to up- or down sample the data.
Shift
Lets apply a simple shift via the :py:meth:`~saqc.SaQC.shift` method.
>>> import saqc
>>> qc = saqc.SaQC(data)
>>> qc = qc.shift('SoilMoisture', target='SoilMoisture_bshift', freq='10min', method='bshift')
Target parameter
We selected a new target
field, to store the shifted data to a new field, so that our original data wouldn't be
overridden.