Snippets Groups Projects

fixing doc references · 80c1508f
David Schäfer authored 2 years ago

80c1508f

DataRegularisation.rst 26.24 KiB

Data Regularization

The tutorial aims to introduce the usage of SaQC methods, in order to obtain regularly sampled data derivatives from given time series data input. Regularly sampled time series data, is data, that exhibits a constant temporal spacing in between subsequent data points.

In the following steps, the tutorial guides through the usage of the SaQC :doc:`resampling <../funcs/generic>` library.

Initially, we introduce and motivate regularization techniques and we do import the tutorial data.
- :ref:`Why Regularization <cookbooks/DataRegularisation:Why Regularization?>`
- :ref:`Tutorial Data <cookbooks/DataRegularisation:Tutorial Data>`
We will get an overview over the main :ref:`Regularization <cookbooks/DataRegularisation:regularization>` methods, starting with the shift.
- :ref:`Shift <cookbooks/DataRegularisation:shift>`
- :ref:`Target Parameter <cookbooks/DataRegularisation:target parameter>`
  - :ref:`Freq Parameter <cookbooks/DataRegularisation:freq parameter>`
  - :ref:`Method Parameter <cookbooks/DataRegularisation:shifting method>`
  - :ref:`Valid Data <cookbooks/DataRegularisation:Valid Data>`
We introduce the notion of valid data and see how sparse intervals and those with multiple values interact with regularization.
- :ref:`Data Loss and Empty Intervals <cookbooks/DataRegularisation:data loss and empty intervals>`
  - :ref:`Empty Intervals <cookbooks/DataRegularisation:empty intervals>`
    - :ref:`Valid Data <cookbooks/DataRegularisation:Valid Data>`
    - :ref:`Data Reduction <cookbooks/DataRegularisation:data reduction>`
    - :ref:`Minimize Shifting <cookbooks/DataRegularisation:minimize shifting distance>`
We use the Aggregation and the Interpolation method.
- :ref:`Aggregation <cookbooks/DataRegularisation:aggregation>`
  - :ref:`Function Parameter <cookbooks/DataRegularisation:aggregation functions>`
  - :ref:`Method Parameter <cookbooks/DataRegularisation:shifting method>`
- :ref:`Interpolation <cookbooks/DataRegularisation:interpolation>`
- :ref:`Representing Data Sparsity <cookbooks/DataRegularisation:interpolation and data sparsity>`
We see how regularization interacts with Flags.
- :ref:`Flags and Regularization <cookbooks/DataRegularisation:flags and regularization>`

Why Regularization?

Often, measurement data does not come in regularly sampled time series. The reasons, why one usually would like to have time series data, that exhibits a constant temporal gap size in between subsequent measurements, are manifold.

The 2 foremost important ones, may be, that statistics, such as mean and standard deviation usually presuppose the set of data points, they are computed of, to be equally weighted.

The second reason, is, that, relating data of different sources to another, is impossible, if one has not a mapping at hand, that relates the different date time indices to each other. One easy and intuitive way of constructing such a mapping, is to just resample all data at the same (regular) timestamp.

Tutorial Data

The following dataset of Soil Moisture measurements may serve as example data set:

Lets import it and check out the first and last lines. .. doctest:: example

>>> import pandas as pd
>>> data_path = './resources/data/SoilMoisture.csv'
>>> data = pd.read_csv(data_path, index_col=0)
>>> data.index = pd.DatetimeIndex(data.index)
>>> data
                     SoilMoisture
2021-01-01 00:09:07     23.429701
2021-01-01 00:18:55     23.431900
2021-01-01 00:28:42     23.343100
2021-01-01 00:38:30     23.476400
2021-01-01 00:48:18     23.343100
...                           ...
2021-03-20 07:13:49    152.883102
2021-03-20 07:26:16    156.587906
2021-03-20 07:40:37    166.146194
2021-03-20 07:54:59    164.690598
2021-03-20 08:40:41    155.318893
<BLANKLINE>
[10607 rows x 1 columns]

The data series seems to start with a sampling rate of roughly 10 minutes. Somewhere the sampling rate changes, and at the end it seems to exhibit an intended sampling rate of 15 minutes.

Finding out about the proper sampling a series should be regularized to, is a subject on its own and wont be covered here. Usually, the intended sampling rate of sensor data is known from the specification of the sensor.

If that is not the case, and if there seem to be more than one candidates for a rate regularization, a rough rule of thumb, aiming at minimization of data loss and data manipulation, may be, to go for the smallest rate seemingly present in the data.

Regularization

So lets transform the measurements timestamps to have a regular 10 minutes frequency. In order to do so, we have to decide what to do with each time stamps associated data, when we alter the timestamps value.

Basically, there are three types of :doc:`regularization <../funcs/resampling>` methods:

We could keep the values as they are, and thus, just :ref:`shift <cookbooks/DataRegularisation:Shift>` them in time to match the equidistant 10 minutes frequency grid, we want the data to exhibit.
We could calculate new, synthetic data values for the regular timestamps, via an :ref:`interpolation <cookbooks/DataRegularisation:Interpolation>` method.
We could apply some :ref:`aggregation <cookbooks/DataRegularisation:Aggregation>` to up- or down sample the data.

Shift

Lets apply a simple shift via the :py:meth:`~saqc.SaQC.shift` method.

>>> import saqc
>>> qc = saqc.SaQC(data)
>>> qc = qc.shift('SoilMoisture', target='SoilMoisture_bshift', freq='10min', method='bshift')

Target parameter

We selected a new target field, to store the shifted data to a new field, so that our original data wouldn't be overridden.