progress

3e20b867 · Peter Lünenschloß · 00bdb3f4 · 3e20b867 · 3e20b867
Commit 3e20b867 authored 3 years ago by Peter Lünenschloß
--- a/sphinxdoc/cook_books/MultivariateFlagging.rst
+++ b/sphinxdoc/cook_books/MultivariateFlagging.rst
@@ -52,7 +52,7 @@ internally without introducing artificially fill values to them.
   data.index = pd.DatetimeIndex(data.index)
   qc = saqc.SaQC([data, maint])  # dataframes "data" and "maint" are integrated internally

-We can check out the filds, the newly generated :py:class:`~saqc.SaQC` object contains as follows:
+We can check out the fields, the newly generated :py:class:`~saqc.SaQC` object contains as follows:

 .. doctest:: default

@@ -78,7 +78,7 @@ was maintained. Lets have a look at those:
   2017-12-12 14:10:00    2017-12-12 14:50:00
   Name: maint, dtype: object

-Measurements collected while maintenance, are not trustworthy, so any measurement taken, in any of the listed
+Measurements collected while maintenance are not trustworthy, so any measurement taken, in any of the listed
 intervals should be flagged right away. This can be achieved, with the :py:meth:`~saqc.SaQC.flagManual` method. Also,
 we will flag out-of-range values in the data with the :py:meth:`~saqc.SaQC.flagRange` method:

@@ -108,9 +108,9 @@ Lets check out the resulting flags for the *sac254* variable with the :py:meth:`
   >>> qc.plot('sac254_raw') #doctest:+SKIP

 Now we should figure out, what sampling rate the data is intended to have, by accessing the *_raw* variables
-constituting the sensor data. (Since :py:attr:`saqc.SaQC.data` yields a common
+constituting the sensor data. Since :py:attr:`saqc.SaQC.data` yields a common
 `pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ object, we can index it with
-the desired variables as column names and have a look at the console output to get a first impression.)
+the desired variables as column names and have a look at the console output to get a first impression.

 .. doctest:: default

@@ -152,53 +152,142 @@ But checking out values around *2017-10-29*, we notice, that the sampling rate s
   2017-10-29 08:32:01     39.4337    112.027           10.66
   2017-10-29 08:47:01     40.4987    112.450           10.64

-Those instabilities do bias most statistical evaluations and it is common practize to apply some
+Those instabilities do bias most statistical evaluations and it is common practice to apply some
 :doc:`resampling functions <../funcSummaries/resampling>` onto the data, to obtain a regularly spaced timestamp.
-(See also the :ref:`harmonization tutorial <./cook_books/DataRegularisation:data regularisation> for more informations
+(See also the :ref:`harmonization tutorial <cook_books/DataRegularisation:data regularisation>` for more informations
 on that topic.)

-We will apply :py:meth:`linear harmonisation <saqc.SaQC.linear>`, to interpolate pillar points of multiples of *15*
-minutes linearly.
+We will apply :py:meth:`linear harmonisation <saqc.SaQC.linear>` to all the sensor data variables,
+to interpolate pillar points of multiples of *15* minutes linearly.

+.. doctest:: default
+
+   >>> qc = qc.linear(['sac254_raw', 'level_raw', 'water_temp_raw'], freq='15min')
+
+.. plot::
+   :context: close-figs
+   :include-source: False
+
+   qc = qc.linear(['sac254_raw', 'level_raw', 'water_temp_raw'], freq='15min')
+
+
+The resulting timeseries has regular timestamp and includes only values that evaluate to `NaN` or did pass the range
+check and the maintenance data flagging:
+
+
+.. doctest:: default
+
+   >>> qc.data['sac254_raw'] #doctest:+NORMALIZE_WHITESPACE
+   Timestamp
+   2016-01-01 00:00:00          NaN
+   2016-01-01 00:15:00    18.617873
+   2016-01-01 00:30:00    18.942700
+   2016-01-01 00:45:00    18.858787
+   2016-01-01 01:00:00    18.756467
+                            ...
+   2017-12-31 23:00:00    43.631540
+   2017-12-31 23:15:00    43.613533
+   2017-12-31 23:30:00    43.274033
+   2017-12-31 23:45:00    43.674453
+   2018-01-01 00:00:00          NaN
+   Name: sac254_raw, Length: 70194, dtype: float64
+
+.. plot::
+   :context:
+   :include-source: True
+   :format: doctest
+
+   >>> qc.plot('sac254_raw') # doctest:+SKIP

-* Flagging missing values via :py
-* Flagging missing values via :py:func:`flagMissing <Functions.saqc.flagMissing>`.
-* Flagging out of range values via :py:func:`flagRange <Functions.saqc.flagRange>`.
-* Flagging values, where the Specific Conductance (\ *K25*\ ) drops down to near zero. (via :py:func:`flagGeneric <Functions.saqc.flag>`)
-* Resampling the data via linear Interpolation (:py:func:`linear <Functions.saqc.linear>`).

 Drift Correction
 ----------------

-Exponential Drift
-^^^^^^^^^^^^^^^^^
+The variables *SAK254* and *Turbidity* show drifting behavior originating from dirt, that accumulates on the light
+sensitive sensor surfaces over time. The effect, the dirt accumulation has on the measurement values, is assumed to be
+properly described by an exponential model. The Sensors are cleaned periodocally, resulting in a periodical reset of
+the drifting effect. The Dates and Times of the maintenance events are input to the
+:py:meth:`~saqc.SaQC.correctDrift>`, that will correct the data in between any two such maintenance intervals.

+.. doctest:: default

-* The variables *SAK254* and *Turbidity* show drifting behavior originating from dirt, that accumulates on the light sensitive sensor surfaces over time.  
-* The effect, the dirt accumulation has on the measurement values, is assumed to be properly described by an exponential model.
-* The Sensors are cleaned periodocally, resulting in a periodical reset of the drifting effect. 
-* The Dates and Times of the maintenance events are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals.
-
-Linear Long Time Drift
-^^^^^^^^^^^^^^^^^^^^^^
+   >>> qc = qc.correctDrift('sac254_raw', target='sac254_corrected',maintenance_field='maint', model=expDriftModel)

+.. plot::
+   :context: close-figs
+   :include-source: False

-* Afterwards, there remains a long time linear Drift in the *SAK254* and *Turbidity* measurements, originating from scratches, that accumule on the sensors glass lenses over time
-* The lenses are replaced periodically, resulting in a periodical reset of that long time drifting effect
-* The Dates and Times of the lenses replacements are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals according to the assumption of a linearly increasing bias.
+   qc = qc.correctDrift('sac254_raw', target='sac254_corrected',maintenance_field='maint', model='exponential')

-Maintenance Intervals Flagging
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Check out results

+.. plot::
+   :context:
+   :include-source: True
+   :format: doctest

-* The *SAK254* and *Turbidity* values, obtained while maintenance, are, of course not trustworthy, thus, all the values obtained while maintenance get flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
-* When maintaining the *SAK254* and *Turbidity* sensors, also the *NO3* sensors get removed from the water - thus, they also have to be flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
+   >>> plt.plot(qc.data_raw['sac254_raw'])
+   >>> plt.plot(qc.data_raw['sac254_corrected'])

 Apply Multivariate Flagging
 ---------------------------

-Basically following the *oddWater* procedure, as suggested in *Talagala, P.D. et al (2019): A Feature-Based Procedure for Detecting Technical Outliers in Water-Quality Data From In Situ Sensors. Water Ressources Research, 55(11), 8547-8568.*
+We are basically following the *oddWater* procedure, as suggested in *Talagala, P.D. et al (2019): A Feature-Based
+Procedure for Detecting Technical Outliers in Water-Quality Data From In Situ Sensors. Water Ressources Research,
+55(11), 8547-8568.*
+
+First we define a transformation we want the variables to be normalized with.
+We just import *scipys* `zscore` function and wrap it, so that it will
+be able to digest *nan* values without returning *nan*
+
+.. testcode:: default
+
+   from scipy.stats import zscore
+   zscore_func = lambda x: zscore(x, nan_policy='omit')
+
+.. plot::
+   :context: close-figs
+   :include-source: False
+
+   from scipy.stats import zscore
+   zscore_func = lambda x: zscore(x, nan_policy='omit')
+
+Now we can pass the function to the :py:meth:`saqc.SaQC.transform` method.
+
+.. testcode:: default
+
+   qc = qc.transform(['sac254_raw', 'level_raw', 'water_temp_raw'], target=['sac_z', 'level_z', 'water_z'], func=zscore_func, freq='30D')
+
+.. plot::
+   :context: close-figs
+   :include-source: False
+
+   qc = qc.transform(['sac254_raw', 'level_raw', 'water_temp_raw'], target=['sac_z', 'level_z', 'water_z'], func=zscore_func, freq='30D')
+
+The idea of the *oddWater* algorithm, is, to assign any timestamp a score, derived from the distance of the *k* nearest
+neighbors of the datapoint related to that score. We can do this, via the :py:meth:`~saqc.SaQC.assignKNNScores` method.
+
+.. testsetup:: default
+
+   qc = qc.assignKNNScore(field=['sac254_z', 'level_z', 'water_temp_z'], target='kNNscores', freq='30D', n=5)
+
+.. plot::
+   :context: close-figs
+   :include-source: True
+   :format: doctest
+
+   >>> qc = qc.assignKNNScore(['sac254_z', 'level_z', 'water_temp_z'], target='kNNscores', freq='30D', n=5)
+   >>> qc.plot('kNNscores') # doctest:+SKIP
+
+Those scores roughly correlate with the isolation of the scored points in the phase space. For example, have a look at
+the phase space of *sac* and *level*
+
+.. plot::
+   :context: close-figs
+   :include-source: True
+   :format: doctest

+   >>> qc.plot('sac_z', phaseplot='level_z') # doctest:+SKIP

 * Variables *SAK254*\ , *Turbidity*\ , *Pegel*\ , *NO3N*\ , *WaterTemp* and *pH* get transformed to comparable scales
 * We are obtaining nearest neighbor scores and assigign those to a new variable, via :py:func:`assignKNNScores <Functions.saqc.assignKNNScores>`.

--- a/sphinxdoc/coredoc/core.py
+++ b/sphinxdoc/coredoc/core.py