Skip to content
Snippets Groups Projects
Commit 88bbdd3d authored by Peter Lünenschloß's avatar Peter Lünenschloß
Browse files

more text

parent b3bd6101
No related branches found
No related tags found
7 merge requests!685Release 2.4,!684Release 2.4,!567Release 2.2.1,!566Release 2.2,!501Release 2.1,!372fix doctest snippets,!369Current documentation
Pipeline #26714 passed with stages
in 2 minutes and 50 seconds
......@@ -33,8 +33,6 @@ clean:
doc:
# generate parent fake module for the functions to be documented
python make_doc_module.py -p "saqc/funcs" -t "$(FUNCTIONS)" -sr ".."
# make rest files from fake module
python make_doc_rst.py -p "$(FUNCTIONS)" -t "sphinx-doc/$(FUNCTIONS)" -sr ".."
# make rest folders from markdown folders
for k in $(MDLIST); do python make_md_to_rst.py -p sphinx-doc/"$$k" -sr ".."; done
# make the html build
......
# Data Regularisation
The tutorial aims to introduce the usage of `SaQC` methods in order to obtain regularly sampled data derivates.
from given timeseris data input. Regularly sampled timeseries data is data, that has constant temporal spacing between subsequent
from given timeseris data input. Regularly sampled timeseries data, is data, that has constant temporal spacing between subsequent
datapoints.
## Data
We concentrate on the following
Usually, meassurement data does not come in regularly sampled timeseries. The reasons on the other hand, why one would
like to have timeseries data, that exhibits a constant temporal gap size
in between subsequent meassurements, are manifold.
The 2 foremost important ones, may be, that statistics, such as mean and standard deviation
usually presupposes the set of data points, they are computed of, to
be of "equal" weights.
The second reason is, that, relating data of different sources to another, is impossible, if one
has not a mapping at hand, that relates the different meassurement times to each other.
The following [dataset](../ressources/data/SoilMoisture.csv) of Soil Moisture meassurements may serve as
explainatory data:
![](../ressources/images/cbooks_SoilMoisture.png)
Lets import it via:
```python
import pandas as pd
data = pd.read_csv(data_path, col_index=1)
data.index = pd.DatetimeIndex(data.index)
```
Now lets check out the data s timestamps
```python
>>> data
SoilMoisture
Date Time
2021-01-01 00:09:07 23.429701
2021-01-01 00:18:55 23.431900
2021-01-01 00:28:42 23.343100
2021-01-01 00:38:30 23.476400
2021-01-01 00:48:18 23.343100
...
2021-03-20 07:13:49 152.883102
2021-03-20 07:26:16 156.587906
2021-03-20 07:40:37 166.146194
2021-03-20 07:54:59 164.690598
2021-03-20 08:40:41 155.318893
[10607 rows x 1 columns]
```
So, the data seems to start with an intended sampling rate of about *10* minutes. Where, at the end, the interval seems to
have changed to somewhat *15* minutes. Finding out about the proper sampling a series should be regularized to is a
a subject that wont be covered here. Usually the intended sampling rate of sensor data is known from the specification.
If thats not the case, and if there seem to be more than one candidates, a rough rule of thumb to mitigate data loss,
may be to go for the smallest rate.
So lets transform the meassurements timestamps to have a regular *10* minutes frequency. In order to do so,
we have to decide what to do with the associated data points.
Basically there are three possibilities: We could keep the values as they are, and thus,
just [shift](#Shift) them in time to match the equidistant *10* minutes frequency grid. Or, we could calculate new,
synthetic meassurement values for the regular timestamps, via an [interpolation](#Interpolation) method. Or we could
apply some [aggregation](#Resampling) to up- or downsample the data.
## Shift
Lets apply a simple shift via the :py:func:`saqc.shift <Functions.saqc.shift>` method.
```python
saqc = saqc.shift('SoilMoisture', target='SoilMoisture_bshift', freq='10min', method='bshift')
```
* We selected a new target field to store the shifted data to, to not override our original data.
* We passed the `freq` keyword of the intended sampling frequency in terms of a
[date alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) string.
* With the `method` keyword, we determined the direction of the shift. We passed it the string `bshift` -
which applies a *backwards* shift, so meassurements get shifted backwards, until they match a timestamp
that is a multiple of *10* minutes. (See :py:func:`saqc.shift <Functions.saqc.shift>` documentation for more
details on the keywords.)
Lets see how the data is now sampled. Therefore, we use the `raw` output from the
:py:meth:`saqc.getResult <saqc.core.core.SaQC>` method. This will prevent the methods output from
being merged to a `pandas.DataFrame` object and the changes from the resampling will be easier
comprehendable from one look.:
```python
>>> saqc = saqc.evaluate()
>>> data_serult = saqc.getResult(raw=True)[0]
>>> data_result
SoilMoisture | SoilMoisture_bshift |
================================ | ========================================= |
Date Time | Date Time |
2021-01-01 00:00:00 23.429701 | 2021-01-01 00:09:07 23.429701 |
2021-01-01 00:10:00 23.431900 | 2021-01-01 00:18:55 23.431900 |
2021-01-01 00:20:00 23.343100 | 2021-01-01 00:28:42 23.343100 |
2021-01-01 00:30:00 23.476400 | 2021-01-01 00:38:30 23.476400 |
2021-01-01 00:40:00 23.343100 | 2021-01-01 00:48:18 23.343100 |
2021-01-01 00:50:00 23.298800 | 2021-01-01 00:58:06 23.298800 |
2021-01-01 01:00:00 23.387400 | 2021-01-01 01:07:54 23.387400 |
2021-01-01 01:10:00 23.343100 | 2021-01-01 01:17:41 23.343100 |
2021-01-01 01:20:00 23.298800 | 2021-01-01 01:27:29 23.298800 |
2021-01-01 01:30:00 23.343100 | 2021-01-01 01:37:17 23.343100 |
... | ... ... |
2021-03-20 07:20:00 156.587906 | 2021-03-20 05:07:02 137.271500 |
2021-03-20 07:30:00 NaN | 2021-03-20 05:21:35 138.194107 |
2021-03-20 07:40:00 166.146194 | 2021-03-20 05:41:59 154.116806 |
2021-03-20 07:50:00 164.690598 | 2021-03-20 06:03:09 150.567505 |
2021-03-20 08:00:00 NaN | 2021-03-20 06:58:10 145.027496 |
2021-03-20 08:10:00 NaN | 2021-03-20 07:13:49 152.883102 |
2021-03-20 08:20:00 NaN | 2021-03-20 07:26:16 156.587906 |
2021-03-20 08:30:00 NaN | 2021-03-20 07:40:37 166.146194 |
2021-03-20 08:40:00 155.318893 | 2021-03-20 07:54:59 164.690598 |
[11286] [10607]
```
We see the first and last *10* datapoints of both, the original data timeseries and the shifted one.
Obveously, the shifted data series now exhibits a regular sampling rate of *10* minutes, with the index
ranging from the latest timestamp, that is a multiple of *10* minutes and preceeds the initial timestamp
of the original data, up to the first *10* minutes multiple, that succeeds the last original datas timestamp.
This is default behavior to all the :doc:`regularisations <../Functions/regularisation>` provided by `saqc`.
The number of datapoints (displayed at the bottom of the table columns) has changed through the
transformation as well. That change stems from 2 sources mainly:
Firstly, if there is no data point within an interval of the passed frequency, that could be shifted to match a multiple
of the frequency, a `NaN` value gets inserted to represent the fact, that at this position there is data missing.
Second, if there are multiple values present, within an interval with size according to the passed `freq`, this values
get reduced to one single value, that will get associated with the intervals timestamp.
This reduction depends on the selected :doc:`regularisation <../Functions/regularisation>` method.
We applied a backwards :py:func:`shift <Functions.saqc.shift>` with a *10* minutes frequency,
so the the first value, encountered after any multiple of *10* minutes, gets shifted backwards to be aligned with
the desired frequency and any other value in that *10* minutes interval just gets discarded.
See the below chunk of our processed *SoilMoisture* data set to get an idea of the effect. There are 2 meassurements
within the *10* minutes interval ranging from `2021-01-01 07:30:00` to `2021-01-01 07:40:00` present
in the original data - and only the first of the two reappears in the shifted data set, as representation
for that interval.
```python
>>> data_result['2021-01-01T07:00:00':'2021-01-01T08:00:00']
SoilMoisture_bshift | SoilMoisture |
================================ | ========================================= |
Date Time | Date Time |
2021-01-01 07:00:00 23.3431 | 2021-01-01 07:00:41 23.3431 |
2021-01-01 07:10:00 23.3431 | 2021-01-01 07:10:29 23.3431 |
2021-01-01 07:20:00 23.2988 | 2021-01-01 07:20:17 23.2988 |
2021-01-01 07:30:00 23.3874 | 2021-01-01 07:30:05 23.3874 |
2021-01-01 07:40:00 23.3431 | 2021-01-01 07:39:53 23.3853 |
2021-01-01 07:50:00 23.3874 | 2021-01-01 07:49:41 23.3431 |
```
Notice, how, for example, the data point for `2021-01-01 07:49:41` gets shifted all the way back, to
`2021-01-01 07:40:00` - although, shifting it forward to `07:40:00` would be less a manipulation, since this timestamp
appears to be closer to the original one.
To shift to any frequncy aligned timestamp the value that is closest to that timestamp, we
can perform a *nearest shift* instead of a simple *back shift*, by using the shift method `"nshift"`:
```python
>>> saqc = saqc.shift('SoilMoisture', target='SoilMoisture_nshift', freq='10min', method='nshift')
>>> saqc = saqc.evaluate()
>>> data_result = saqc.getResult(raw=True)[0]
>>> data_result['2021-01-01T07:00:00':'2021-01-01T08:00:00']
SoilMoisture_nshift | SoilMoisture |
================================ | ========================================= |
Date Time | Date Time |
2021-01-01 07:00:00 23.3431 | 2021-01-01 07:00:41 23.3431 |
2021-01-01 07:10:00 23.3431 | 2021-01-01 07:10:29 23.3431 |
2021-01-01 07:20:00 23.2988 | 2021-01-01 07:20:17 23.2988 |
2021-01-01 07:30:00 23.3874 | 2021-01-01 07:30:05 23.3874 |
2021-01-01 07:40:00 23.3853 | 2021-01-01 07:39:53 23.3853 |
2021-01-01 07:50:00 23.3431 | 2021-01-01 07:49:41 23.3431 |
```
Now, any timestamp got assigned, the value that is nearest to it, *if* there is one valid data value available in the
interval surrounding that timestamp with a range of half the frequency. In our example, this would mean, the regular
timestamp would get assigned the nearest value of all the values, that preceed or succeed it by less than *5* minutes.
Maybe check out, what happens with the chunk of the final 2 hours of our shifted *Soil Moisture* dataset to get an idea.
```python
>>> data_result['2021-03-20 07:00:00']
SoilMoisture_nshift | SoilMoisture |
================================ | ========================================= |
Date Time | Date Time |
2021-03-20 07:00:00 145.027496 | 2021-03-20 07:13:49 152.883102 |
2021-03-20 07:10:00 152.883102 | 2021-03-20 07:26:16 156.587906 |
2021-03-20 07:20:00 NaN | 2021-03-20 07:40:37 166.146194 |
2021-03-20 07:30:00 156.587906 | 2021-03-20 07:54:59 164.690598 |
2021-03-20 07:40:00 166.146194 | 2021-03-20 08:40:41 155.318893 |
2021-03-20 07:50:00 164.690598 | 2021-03-20 08:40:41 155.318893 |
2021-03-20 08:00:00 NaN | |
2021-03-20 08:10:00 NaN | |
2021-03-20 08:20:00 NaN | |
2021-03-20 08:30:00 NaN | |
2021-03-20 08:40:00 155.318893 | |
2021-03-20 08:50:00 NaN | |
```
Since there is no data available, for example in the interval from `2021-03-20 07:55:00` to `2021-03-20 08:05:00` - the new value
for the regular timestamp `2021-03-20 08:00:00`, that lies in the center of this interval, is `NaN`.
## aggregation freq=20 Hz (resample)
If we want to comprise several values by aggregation and assign the result to the new regular timestamp, instead of
selecting a single one, we can do this, with the :py:func:`saqc.resample <Functions.saqc.resample>` method.
Lets resample the *SoilMoisture* data to have a *20* minutes sample rate by aggregating every *20* minutes intervals
content with the arithmetic mean (which is implemented by numpies `numpy.mean` function for example).
```python
>>> import numpy
>>> saqc = saqc.resample('SoilMoisture', target='SoilMoisture_mean', freq='20min', method='bagg', agg_func=np.mean)
>>> saqc = saqc.evaluate()
saqc.getResult(raw=True)[0]
SoilMoisture | SoilMoisture_mean |
================================ | ===================================== |
Date Time | Date Time |
2021-01-01 00:09:07 23.429701 | 2021-01-01 00:00:00 23.430800 |
2021-01-01 00:18:55 23.431900 | 2021-01-01 00:20:00 23.409750 |
2021-01-01 00:28:42 23.343100 | 2021-01-01 00:40:00 23.320950 |
2021-01-01 00:38:30 23.476400 | 2021-01-01 01:00:00 23.365250 |
2021-01-01 00:48:18 23.343100 | 2021-01-01 01:20:00 23.320950 |
2021-01-01 00:58:06 23.298800 | 2021-01-01 01:40:00 23.343100 |
2021-01-01 01:07:54 23.387400 | 2021-01-01 02:00:00 23.320950 |
2021-01-01 01:17:41 23.343100 | 2021-01-01 02:20:00 23.343100 |
2021-01-01 01:27:29 23.298800 | 2021-01-01 02:40:00 23.343100 |
2021-01-01 01:37:17 23.343100 | 2021-01-01 03:00:00 23.343100 |
... | ... ... |
2021-03-20 05:07:02 137.271500 | 2021-03-20 05:40:00 154.116806 |
2021-03-20 05:21:35 138.194107 | 2021-03-20 06:00:00 150.567505 |
2021-03-20 05:41:59 154.116806 | 2021-03-20 06:20:00 NaN |
2021-03-20 06:03:09 150.567505 | 2021-03-20 06:40:00 145.027496 |
2021-03-20 06:58:10 145.027496 | 2021-03-20 07:00:00 152.883102 |
2021-03-20 07:13:49 152.883102 | 2021-03-20 07:20:00 156.587906 |
2021-03-20 07:26:16 156.587906 | 2021-03-20 07:40:00 165.418396 |
2021-03-20 07:40:37 166.146194 | 2021-03-20 08:00:00 NaN |
2021-03-20 07:54:59 164.690598 | 2021-03-20 08:20:00 NaN |
2021-03-20 08:40:41 155.318893 | 2021-03-20 08:40:00 155.318893 |
[10607] [5643]
```
You can pass arbitrary function objects to the `agg_func` parameter, to be applied to calculate every intervals result,
as long as this function returns a scalar *float* value upon an array-like input. (So `np.median` would be propper
for calculating the median, `sum`, for assigning the value sum, and so on.)
As it is with the [shift](#shift) functionality, a `method` keyword controlls, weather the
aggregation result of the interval in between 2 regular timestamps gets assigned to the left (=`bagg`) or to the
right (`fagg`) boundary timestamp.
## interpolation
Another common way of obtaining regular timestamps is interpolation.
# flags and regularisation
- see problem with data[95] - flag before interpolation
## back projection of flags
- all
- last
## wrapper
generic functions
=================
* :py:func:`flagGeneric <Functions.saqc.flag>`
* :py:func:`processGeneric <Functions.saqc.processs>`
regularisation Functions
========================
* :py:func:`linear <Functions.saqc.linear>`
* :py:func:`shift <Functions.saqc.shift>`
* :py:func:`resample <Functions.saqc.resample>`
* :py:func:`aggregate <Functions.saqc.aggregate>`
* :py:func:`interpolate <Functions.saqc.interpolate>`
\ No newline at end of file
......@@ -26,7 +26,7 @@ Tutorials and Topics
.. toctree::
:maxdepth: 1
cook_books_md_m2r/dataRegularisation
cook_books_md_m2r/DataRegularisation
cook_books_md_m2r/OutlierDetection
......@@ -40,7 +40,7 @@ Flagging Functions
:glob:
:titlesonly:
Functions/*
moduleAPIs/saqcFunctions
Indices and tables
......
saqc
====
.. automodapi:: saqc.core.core
\ No newline at end of file
saqc
====
.. automodapi:: saqc
\ No newline at end of file
This diff is collapsed.
sphinx-doc/ressources/images/cbooks_SoilMoisture.png

45.5 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment