Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
.. testsetup:: exampleMV
import matplotlib.pyplot as plt
datapath = './ressources/data/hydro_data.csv'
maintpath = './ressources/data/hydro_maint.csv'
configpath = './ressources/data/hydro_config.csv'
.. plot::
:context:
:include-source: False
import matplotlib
import saqc
import pandas as pd
datapath = '../ressources/data/hydro_data.csv'
maintpath = '../ressources/data/hydro_maint.csv'
data = pd.read_csv(datapath, index_col=0)
maint = pd.read_csv(maintpath, index_col=0)
maint.index = pd.DatetimeIndex(maint.index)
data.index = pd.DatetimeIndex(data.index)
qc = saqc.SaQC([data, maint])
Multivariate Flagging
=====================
The tutorial aims to introduce the usage of SaQC in the context of some more complex flagging and processing techniques.
Mainly we will see how to apply Drift Corrections onto the data and how to perform multivariate flagging.
#. `Data Preparation`_
#. `Drift Correction`_
#. `Multivariate Flagging Procedure`_
#. `Config`_
Data Preparation
----------------
First import the data (from the repository), and generate an saqc instance from it. You will need to download the `sensor
data <https://git.ufz.de/rdm-software/saqc/-/blob/develop/sphinxdoc/ressources/data/hydro_config.csv>`_ and the
`maintenance data <https://git.ufz.de/rdm-software/saqc/-/blob/develop/sphinxdoc/ressources/data/hydro_maint.csv>`_
from the `repository <https://git.ufz.de/rdm-software/saqc.git>`_ and make variables `datapath` and `maintpath` be
paths pointing at those downloaded files. Note, that the :py:class:`~saqc.SaQC` digests the loaded data in a list.
This is done, to circumvent having to concatenate both datasets in a pandas Dataframe instance, which would introduce
`NaN` values to both the datasets, wherever their timestamps missmatch. `SaQC` can handle those unaligned data
internally without introducing artificial fill values to them.
.. testcode:: exampleMV
import saqc
import pandas as pd
data = pd.read_csv(datapath, index_col=0)
maint = pd.read_csv(maintpath, index_col=0)
maint.index = pd.DatetimeIndex(maint.index)
data.index = pd.DatetimeIndex(data.index)
qc = saqc.SaQC([data, maint]) # dataframes "data" and "maint" are integrated internally
We can check out the fields, the newly generated :py:class:`~saqc.SaQC` object contains as follows:
.. doctest:: exampleMV
>>> qc.data.columns
Index(['sac254_raw', 'level_raw', 'water_temp_raw', 'maint'], dtype='object', name='columns')
The variables represent meassurements of *water level*, the *specific absorption coefficient* at 254 nm Wavelength,
the *water temperature* and there is also a variable, *maint*, that refers to time periods, where the *sac254* sensor
was maintained. Lets have a look at those:
.. doctest:: exampleMV
>>> qc.data_raw['maint'] # doctest:+SKIP
Timestamp
2016-01-10 11:15:00 2016-01-10 12:15:00
2016-01-12 14:40:00 2016-01-12 15:30:00
2016-02-10 13:40:00 2016-02-10 14:40:00
2016-02-24 16:40:00 2016-02-24 17:30:00
.... ....
2017-10-17 08:55:00 2017-10-17 10:20:00
2017-11-14 15:30:00 2017-11-14 16:20:00
2017-11-27 09:10:00 2017-11-27 10:10:00
2017-12-12 14:10:00 2017-12-12 14:50:00
Name: maint, dtype: object
Measurements collected while maintenance are not trustworthy, so any measurement taken, in any of the listed
intervals should be flagged right away. This can be achieved, with the :py:meth:`~saqc.SaQC.flagManual` method. Also,
we will flag out-of-range values in the data with the :py:meth:`~saqc.SaQC.flagRange` method:
.. doctest:: exampleMV
>>> qc = qc.flagManual('sac254_raw', mdata='maint', method='closed', label='Maintenance')
>>> qc = qc.flagRange('level_raw', min=0)
>>> qc = qc.flagRange('water_temp_raw', min=-1, max=40)
>>> qc = qc.flagRange('sac254_raw', min=0, max=60)
.. plot::
:context:
:include-source: False
qc = qc.flagManual('sac254_raw', mdata='maint', method='closed', label='Maintenance')
qc = qc.flagRange('level_raw', min=0)
qc = qc.flagRange('water_temp_raw', min=-1, max=40)
qc = qc.flagRange('sac254_raw', min=0, max=60)
Lets check out the resulting flags for the *sac254* variable with the :py:meth:`~saqc.SaQC.plot` method:
>>> qc.plot('sac254_raw') #doctest:+SKIP
.. plot::
:context:
:include-source: False
:width: 80 %
:class: center
qc.plot('sac254_raw')
Now we should figure out, what sampling rate the data is intended to have, by accessing the *_raw* variables
constituting the sensor data. Since :py:attr:`saqc.SaQC.data` yields a common
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ object, we can index it with
the desired variables as column names and have a look at the console output to get a first impression.
.. doctest:: exampleMV
>>> qc.data[['sac254_raw', 'level_raw', 'water_temp_raw']] # doctest:+NORMALIZE_WHITESPACE
columns sac254_raw level_raw water_temp_raw
Timestamp
2016-01-01 00:02:00 18.4500 103.290 4.84
2016-01-01 00:17:00 18.6437 103.285 4.82
2016-01-01 00:32:00 18.9887 103.253 4.81
2016-01-01 00:47:00 18.8388 103.210 4.80
2016-01-01 01:02:00 18.7438 103.167 4.78
... ... ...
2017-12-31 22:47:00 43.2275 186.060 5.49
2017-12-31 23:02:00 43.6937 186.115 5.49
2017-12-31 23:17:00 43.6012 186.137 5.50
2017-12-31 23:32:00 43.2237 186.128 5.51
2017-12-31 23:47:00 43.7438 186.130 5.53
<BLANKLINE>
[70199 rows x 3 columns]
The data seems to have a fairly regular sampling rate of *15* minutes at first glance.
But checking out values around *2017-10-29*, we notice, that the sampling rate seems not to be totally stable:
.. doctest:: exampleMV
>>> qc.data[['sac254_raw', 'level_raw', 'water_temp_raw']]['2017-10-29 07:00:00':'2017-10-29 09:00:00'] # doctest:+NORMALIZE_WHITESPACE
columns sac254_raw level_raw water_temp_raw
Timestamp
2017-10-29 07:02:00 40.3050 112.570 10.91
2017-10-29 07:17:00 39.6287 112.497 10.90
2017-10-29 07:32:00 39.5800 112.460 10.88
2017-10-29 07:32:01 39.9750 111.837 10.70
2017-10-29 07:47:00 39.1350 112.330 10.84
2017-10-29 07:47:01 40.6937 111.615 10.68
2017-10-29 08:02:00 40.4938 112.040 10.77
2017-10-29 08:02:01 39.3337 111.552 10.68
2017-10-29 08:17:00 41.5238 111.835 10.72
2017-10-29 08:17:01 38.6963 111.750 10.69
2017-10-29 08:32:01 39.4337 112.027 10.66
2017-10-29 08:47:01 40.4987 112.450 10.64
Those instabilities do bias most statistical evaluations and it is common practice to apply some
:doc:`resampling functions <../funcSummaries/resampling>` onto the data, to obtain a regularly spaced timestamp.
(See also the :ref:`harmonization tutorial <cook_books/DataRegularisation:data regularisation>` for more informations
on that topic.)
We will apply :py:meth:`linear harmonisation <saqc.SaQC.linear>` to all the sensor data variables,
to interpolate pillar points of multiples of *15* minutes linearly.
.. doctest:: exampleMV
>>> qc = qc.linear(['sac254_raw', 'level_raw', 'water_temp_raw'], freq='15min')
.. plot::
:context: close-figs
:include-source: False
qc = qc.linear(['sac254_raw', 'level_raw', 'water_temp_raw'], freq='15min')
The resulting timeseries now has has regular timestamp.
.. doctest:: exampleMV
>>> qc.data['sac254_raw'] #doctest:+NORMALIZE_WHITESPACE
Timestamp
2016-01-01 00:00:00 NaN
2016-01-01 00:15:00 18.617873
2016-01-01 00:30:00 18.942700
2016-01-01 00:45:00 18.858787
2016-01-01 01:00:00 18.756467
...
2017-12-31 23:00:00 43.631540
2017-12-31 23:15:00 43.613533
2017-12-31 23:30:00 43.274033
2017-12-31 23:45:00 43.674453
2018-01-01 00:00:00 NaN
Name: sac254_raw, Length: 70194, dtype: float64
Since points, that were identified as malicous get excluded before the harmonization, the resulting regularly sampled
timeseries does not include them anymore:
.. doctest:: exampleMV
>>> qc.plot('sac254_raw') # doctest:+SKIP
.. plot::
:context:
:include-source: False
:width: 80 %
:class: center
qc.plot('sac254_raw')
Drift Correction
----------------
The variables *SAK254* and *Turbidity* show drifting behavior originating from dirt, that accumulates on the light
sensitive sensor surfaces over time. The effect, the dirt accumulation has on the measurement values, is assumed to be
properly described by an exponential model. The Sensors are cleaned periodocally, resulting in a periodical reset of
the drifting effect. The Dates and Times of the maintenance events are input to the method
:py:meth:`~saqc.SaQC.correctDrift`, that will correct the data in between any two such maintenance intervals.
.. doctest:: exampleMV
>>> qc = qc.correctDrift('sac254_raw', target='sac254_corrected',maintenance_field='maint', model='exponential')
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:class: center
qc = qc.correctDrift('sac254_raw', target='sac254_corrected',maintenance_field='maint', model='exponential')
Check out the results for the year *2016*
.. doctest:: exampleMV
>>> plt.plot(qc.data_raw['sac254_raw']['2016'], alpha=.5, color='black', label='original') # doctest:+SKIP
>>> plt.plot(qc.data_raw['sac254_corrected']['2016'], color='black', label='corrected') # doctest:+SKIP
.. plot::
:context:
:include-source: False
plt.figure(figsize=(16,9))
plt.plot(qc.data_raw['sac254_raw']['2016'], alpha=.5, color='black', label='original')
plt.plot(qc.data_raw['sac254_corrected']['2016'], color='black', label='corrected')
plt.legend()
Multivariate Flagging Procedure
-------------------------------
We are basically following the *oddWater* procedure, as suggested in *Talagala, P.D. et al (2019): A Feature-Based
Procedure for Detecting Technical Outliers in Water-Quality Data From In Situ Sensors. Water Ressources Research,
55(11), 8547-8568.*
First, we define a transformation, we want the variables to be transformed with, to make them equally significant in
their common feature space. We go for the common pick of just *zScoring* the variables.
Therefor, we just import *scipys* `zscore` function and wrap it, so that it will be able to digest *nan* values,
without returning *nan*.
.. doctest:: exampleMV
>>> from scipy.stats import zscore
>>> zscore_func = lambda x: zscore(x, nan_policy='omit')
.. plot::
:context: close-figs
:include-source: False
from scipy.stats import zscore
zscore_func = lambda x: zscore(x, nan_policy='omit')
Now we can pass the function to the :py:meth:`~saqc.SaQC.transform` method.
.. doctest:: exampleMV
>>> qc = qc.transform(['sac254_corrected', 'level_raw', 'water_temp_raw'], target=['sac_z', 'level_z', 'water_z'], func=zscore_func, freq='30D')
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:class: center
qc = qc.transform(['sac254_raw', 'level_raw', 'water_temp_raw'], target=['sac_z', 'level_z', 'water_z'], func=zscore_func, freq='30D')
The idea of the multivariate flagging approach we are going for, is,
to assign any datapoint a score, derived from the distance this datapoint has to its *k* nearest
neighbors in feature space. We can do this, via the :py:meth:`~saqc.SaQC.assignKNNScore` method.
.. doctest:: exampleMV
>>> qc = qc.assignKNNScore(field=['sac_z', 'level_z', 'water_z'], target='kNNscores', freq='30D', n=5)
Lets have a look at the resulting score variable.
.. doctest:: exampleMV
>>> qc.plot('kNNscores') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:class: center
qc = qc.assignKNNScore(field=['sac_z', 'level_z', 'water_z'], target='kNNscores', freq='30D', n=5)
qc.plot('kNNscores')
Those scores roughly correlate with the isolation of the scored points in the feature space. For example, have a look at
the projection of this feature space onto the 2 dimensional *sac* - *level* space, in november 2016:
.. doctest:: exampleMV
>>> qc.plot('sac_z', phaseplot='level_z', xscope='2016-11') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:class: center
qc.plot('sac_z', phaseplot='level_z', xscope='2016-11')
We can clearly see some outliers, that seem to be isolated from the cloud of the normalish points. Since those outliers are
correlated with relatively high *kNNscores*, we could try to calculate a threshold that determines, how extreme an
*kNN* score has to be to qualify an outlier. Therefor, we will use the saqc-implementation of the
`STRAY <https://arxiv.org/pdf/1908.04000.pdf>`_ algorithm, which is available as the method:
:py:meth:`~saqc.SaQC.flagByStray`. This method will mark some samples of the `kNNscore` variable as anomaly.
Subsequently we project this marks (or *flags*) on to the *sac* variable with a call to
:py:meth:`~saqc.SaQC.transferFlags`. For the sake of demonstration, we also project the flags
on the normalized *sac* and plot the flagged values in the *sac_z* - *level_z* feature space.
.. doctest:: exampleMV
>>> qc = qc.flagByStray(field='kNNscores', freq='30D', alpha=.3)
>>> qc = qc.transferFlags(field='kNNscores', target='sac254_corrected', label='STRAY')
>>> qc = qc.transferFlags(field='kNNscores', target='sac_z', label='STRAY')
>>> qc.plot('sac254_corrected', xscope='2016-11') # doctest:+SKIP
>>> qc.plot('sac_z', phaseplot='level_z', xscope='2016-11') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
qc = qc.flagByStray(field='kNNscores', freq='30D', alpha=.3)
qc = qc.transferFlags(field='kNNscores', target='sac254_corrected', label='STRAY')
qc = qc.transferFlags(field='kNNscores', target='sac_z', label='STRAY')
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:align: center
qc.plot('sac254_corrected', xscope='2016-11')
.. plot::
:context: close-figs
:include-source: False
:width: 80 %
:class: center
qc.plot('sac_z', phaseplot='level_z', xscope='2016-11')
Config
------
.. testcode:: exampleMV
:hide:
saqc.fromConfig(configpath, [data, maint])
To configure `saqc` to execute the above data processing and flagging steps, the config file would have to look
as follows:
.. literalinclude:: ../ressources/data/hydro_config.csv