Newer
Older
.. testsetup:: default
datapath = './ressources/data/hydro_data.csv'
maintpath = './ressources/data/hydro_maint.csv'
.. plot::
:context:
:include-source: False
import matplotlib
import saqc
import pandas as pd
datapath = '../ressources/data/hydro_data.csv'
maintpath = '../ressources/data/hydro_maint.csv'
data = pd.read_csv(datapath, index_col=0)
maint = pd.read_csv(maintpath, index_col=0)
maint.index = pd.DatetimeIndex(maint.index)
data.index = pd.DatetimeIndex(data.index)
qc = saqc.SaQC([data, maint])
Multivariate Flagging
=====================
The tutorial aims to introduce the usage of SaQC in the context of some more complex flagging and processing techniques.
Mainly we will see how to apply Drift Corrections onto the data and how to perform multivariate flagging.
#. `Data Preparation`_
#. `Drift Correction`_
Data Preparation
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
First import the data (from the repository), and generate an saqc instance from it. You will need to download the `sensor
data <https://git.ufz.de/rdm-software/saqc/-/blob/develop/sphinxdoc/ressources/data/hydro_config.csv>`_ and the
`maintenance data <https://git.ufz.de/rdm-software/saqc/-/blob/develop/sphinxdoc/ressources/data/hydro_maint.csv>`_
from the `repository <https://git.ufz.de/rdm-software/saqc.git>`_ and make variables `datapath` and `maintpath` be
paths pointing at those downloaded files. Note, that the :py:class:`~saqc.SaQC` digests the loaded data in a list.
This is done, to circumvent having to concatenate both datasets in a pandas Dataframe instance, which would introduce
`NaN` values to both the datasets, wherever their timestamps missmatch. `SaQC` can handle those unaligned data
internally without introducing artificially fill values to them.
.. testcode:: default
data = pd.read_csv(datapath, index_col=0)
maint = pd.read_csv(maintpath, index_col=0)
maint.index = pd.DatetimeIndex(maint.index)
data.index = pd.DatetimeIndex(data.index)
qc = saqc.SaQC([data, maint]) # dataframes "data" and "maint" are integrated internally
We can check out the filds, the newly generated :py:class:`~saqc.SaQC` object contains as follows:
.. doctest:: default
>>> qc.data.columns
Index(['sac254_raw', 'level_raw', 'water_temp_raw', 'maint'], dtype='object', name='columns')
The variables represent meassurements of *water level*, the *specific absorption coefficient* at 254 nm Wavelength,
the *water temperature* and there is also a variable, *maint*, that refers to time periods, where the *sac254* sensor
was maintained. Lets have a look at those:
.. doctest:: default
>>> qc.data_raw['maint'] # doctest:+SKIP
Timestamp
2016-01-10 11:15:00 2016-01-10 12:15:00
2016-01-12 14:40:00 2016-01-12 15:30:00
2016-02-10 13:40:00 2016-02-10 14:40:00
2016-02-24 16:40:00 2016-02-24 17:30:00
.... ....
2017-10-17 08:55:00 2017-10-17 10:20:00
2017-11-14 15:30:00 2017-11-14 16:20:00
2017-11-27 09:10:00 2017-11-27 10:10:00
2017-12-12 14:10:00 2017-12-12 14:50:00
Name: maint, dtype: object
Measurements collected while maintenance, are not trustworthy, so any measurement taken, in any of the listed
intervals should be flagged right away. This can be achieved, with the :py:meth:`~saqc.SaQC.flagManual` method. Also,
we will flag out-of-range values in the data with the :py:meth:`~saqc.SaQC.flagRange` method:
.. doctest:: default
>>> qc = qc.flagManual('sac254_raw', mdata='maint', method='closed', label='Maintenance')
>>> qc = qc.flagRange('level_raw', min=0)
>>> qc = qc.flagRange('water_temp_raw', min=-1, max=40)
>>> qc = qc.flagRange('sac254_raw', min=0, max=60)
.. plot::
:context:
:include-source: False
qc = qc.flagManual('sac254_raw', mdata='maint', method='closed', label='Maintenance')
qc = qc.flagRange('level_raw', min=0)
qc = qc.flagRange('water_temp_raw', min=-1, max=40)
qc = qc.flagRange('sac254_raw', min=0, max=60)
Lets check out the resulting flags for the *sac254* variable with the :py:meth:`~saqc.SaQC.plot` method:
.. plot::
:context:
:include-source: True
:format: doctest
>>> qc.plot('sac254_raw') #doctest:+SKIP
Now we should figure out, what sampling rate the data is intended to have, by accessing the *_raw* variables
constituting the sensor data. (Since :py:attr:`saqc.SaQC.data` yields a common
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ object, we can index it with
the desired variables as column names and have a look at the console output to get a first impression.)
.. doctest:: default
>>> qc.data[['sac254_raw', 'level_raw', 'water_temp_raw']] # doctest:+NORMALIZE_WHITESPACE
columns sac254_raw level_raw water_temp_raw
Timestamp
2016-01-01 00:02:00 18.4500 103.290 4.84
2016-01-01 00:17:00 18.6437 103.285 4.82
2016-01-01 00:32:00 18.9887 103.253 4.81
2016-01-01 00:47:00 18.8388 103.210 4.80
2016-01-01 01:02:00 18.7438 103.167 4.78
... ... ...
2017-12-31 22:47:00 43.2275 186.060 5.49
2017-12-31 23:02:00 43.6937 186.115 5.49
2017-12-31 23:17:00 43.6012 186.137 5.50
2017-12-31 23:32:00 43.2237 186.128 5.51
2017-12-31 23:47:00 43.7438 186.130 5.53
<BLANKLINE>
[70199 rows x 3 columns]
The data seems to have a fairly regular sampling rate of *15* minutes at first glance.
But checking out values around *2017-10-29*, we notice, that the sampling rate seems not to be totally stable:
.. doctest:: default
>>> qc.data[['sac254_raw', 'level_raw', 'water_temp_raw']]['2017-10-29 07:00:00':'2017-10-29 09:00:00'] # doctest:+NORMALIZE_WHITESPACE
columns sac254_raw level_raw water_temp_raw
Timestamp
2017-10-29 07:02:00 40.3050 112.570 10.91
2017-10-29 07:17:00 39.6287 112.497 10.90
2017-10-29 07:32:00 39.5800 112.460 10.88
2017-10-29 07:32:01 39.9750 111.837 10.70
2017-10-29 07:47:00 39.1350 112.330 10.84
2017-10-29 07:47:01 40.6937 111.615 10.68
2017-10-29 08:02:00 40.4938 112.040 10.77
2017-10-29 08:02:01 39.3337 111.552 10.68
2017-10-29 08:17:00 41.5238 111.835 10.72
2017-10-29 08:17:01 38.6963 111.750 10.69
2017-10-29 08:32:01 39.4337 112.027 10.66
2017-10-29 08:47:01 40.4987 112.450 10.64
Those instabilities do bias most statistical evaluations and it is common practize to apply some
:doc:`resampling functions <../funcSummaries/resampling>` onto the data, to obtain a regularly spaced timestamp.
(See also the :ref:`harmonization tutorial <./cook_books/DataRegularisation:data regularisation> for more informations
on that topic.)
We will apply :py:meth:`linear harmonisation <saqc.SaQC.linear>`, to interpolate pillar points of multiples of *15*
minutes linearly.
* Flagging missing values via :py
* Flagging missing values via :py:func:`flagMissing <Functions.saqc.flagMissing>`.
* Flagging out of range values via :py:func:`flagRange <Functions.saqc.flagRange>`.
* Flagging values, where the Specific Conductance (\ *K25*\ ) drops down to near zero. (via :py:func:`flagGeneric <Functions.saqc.flag>`)
* Resampling the data via linear Interpolation (:py:func:`linear <Functions.saqc.linear>`).
Drift Correction
----------------
Exponential Drift
^^^^^^^^^^^^^^^^^
* The variables *SAK254* and *Turbidity* show drifting behavior originating from dirt, that accumulates on the light sensitive sensor surfaces over time.
* The effect, the dirt accumulation has on the measurement values, is assumed to be properly described by an exponential model.
* The Sensors are cleaned periodocally, resulting in a periodical reset of the drifting effect.
* The Dates and Times of the maintenance events are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals.
Linear Long Time Drift
^^^^^^^^^^^^^^^^^^^^^^
* Afterwards, there remains a long time linear Drift in the *SAK254* and *Turbidity* measurements, originating from scratches, that accumule on the sensors glass lenses over time
* The lenses are replaced periodically, resulting in a periodical reset of that long time drifting effect
* The Dates and Times of the lenses replacements are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals according to the assumption of a linearly increasing bias.
Maintenance Intervals Flagging
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* The *SAK254* and *Turbidity* values, obtained while maintenance, are, of course not trustworthy, thus, all the values obtained while maintenance get flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
* When maintaining the *SAK254* and *Turbidity* sensors, also the *NO3* sensors get removed from the water - thus, they also have to be flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
Apply Multivariate Flagging
---------------------------
Basically following the *oddWater* procedure, as suggested in *Talagala, P.D. et al (2019): A Feature-Based Procedure for Detecting Technical Outliers in Water-Quality Data From In Situ Sensors. Water Ressources Research, 55(11), 8547-8568.*
* Variables *SAK254*\ , *Turbidity*\ , *Pegel*\ , *NO3N*\ , *WaterTemp* and *pH* get transformed to comparable scales
* We are obtaining nearest neighbor scores and assigign those to a new variable, via :py:func:`assignKNNScores <Functions.saqc.assignKNNScores>`.
* We are applying the *STRAY* Algorithm to find the cut_off points for the scores, above which values qualify as outliers. (:py:func:`flagByStray <Functions.saqc.flagByStray>`)
* We project the calculated flags onto the input variables via :py:func:`assignKNNScore <Functions.saqc.assignKNNScore>`.
Postprocessing
--------------
* (Flags reduction onto subspaces)
* Back projection of calculated flags from resampled Data onto original data via :py:func: ``mapToOriginal <Functions.saqc.mapToOriginal>``