Skip to content
Snippets Groups Projects
Commit 1ef08832 authored by David Schäfer's avatar David Schäfer
Browse files

removed obsolete *_md folders

parent 65911f8b
No related branches found
No related tags found
7 merge requests!685Release 2.4,!684Release 2.4,!567Release 2.2.1,!566Release 2.2,!501Release 2.1,!372fix doctest snippets,!369Current documentation
Showing with 0 additions and 1862 deletions
This diff is collapsed.
# Multivariate Flagging
The tutorial aims to introduce the usage of SaQC in the context of some more complex flagging and processing techniques.
Mainly we will see how to apply Drift Corrections onto the data and how to perform multivariate flagging.
0. * [Data Preparation](#Data-Preparation)
1. * [Drift Correction](#Drift-Correction)
2. * [Multivariate Flagging (odd Water)](#Multivariate-Flagging)
## Data preparation
* Flagging missing values via :py:func:`flagMissing <Functions.saqc.flagMissing>`.
* Flagging out of range values via :py:func:`flagRange <Functions.saqc.flagRange>`.
* Flagging values, where the Specific Conductance (*K25*) drops down to near zero. (via :py:func:`flagGeneric <Functions.saqc.flag>`)
* Resampling the data via linear Interpolation (:py:func:`linear <Functions.saqc.linear>`).
## Drift Correction
### Exponential Drift
* The variables *SAK254* and *Turbidity* show drifting behavior originating from dirt, that accumulates on the light sensitive sensor surfaces over time.
* The effect, the dirt accumulation has on the measurement values, is assumed to be properly described by an exponential model.
* The Sensors are cleaned periodocally, resulting in a periodical reset of the drifting effect.
* The Dates and Times of the maintenance events are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals. (Find some formal description of the process [here](sphinx-doc/misc_md/ExponentialModel.md).)
### Linear Long Time Drift
* Afterwards, there remains a long time linear Drift in the *SAK254* and *Turbidity* measurements, originating from scratches, that accumule on the sensors glass lenses over time
* The lenses are replaced periodically, resulting in a periodical reset of that long time drifting effect
* The Dates and Times of the lenses replacements are input to the :py:func:`correctDrift <Functions.saqc.correctDrift>`, that will correct the data in between any two such maintenance intervals according to the assumption of a linearly increasing bias.
### Maintenance Intervals Flagging
* The *SAK254* and *Turbidity* values, obtained while maintenance, are, of course not trustworthy, thus, all the values obtained while maintenance get flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
* When maintaining the *SAK254* and *Turbidity* sensors, also the *NO3* sensors get removed from the water - thus, they also have to be flagged via the :py:func:`flagManual <Functions.saqc.flagManual>` method.
## Multivariate Flagging
Basically following the *oddWater* procedure, as suggested in *Talagala, P.D. et al (2019): A Feature-Based Procedure for Detecting Technical Outliers in Water-Quality Data From In Situ Sensors. Water Ressources Research, 55(11), 8547-8568.*
* Variables *SAK254*, *Turbidity*, *Pegel*, *NO3N*, *WaterTemp* and *pH* get transformed to comparable scales
* We are obtaining nearest neighbor scores and assigign those to a new variable, via :py:func:`assignKNNScores <Functions.saqc.assignKNNScores>`.
* We are applying the *STRAY* Algorithm to find the cut_off points for the scores, above which values qualify as outliers. (:py:func:`flagByStray <Functions.saqc.flagByStray>`)
* We project the calculated flags onto the input variables via :py:func:`assignKNNScore <Functions.saqc.assignKNNScore>`.
## Postprocessing
* (Flags reduction onto subspaces)
* Back projection of calculated flags from resampled Data onto original data via :py:func: `mapToOriginal <Functions.saqc.mapToOriginal>`
\ No newline at end of file
# Outlier Detection and Flagging
The tutorial aims to introduce the usage of `saqc` methods in order to detect outliers in an uni-variate set up.
The tutorial guides through the following steps:
1. We checkout and load the example data set. Subsequently, we initialise an :py:class:`SaQC <saqc.core.core.SaQC>` object.
* [Preparation](#Preparation)
* [Data](#Data)
* [Initialisation](#Initialisation)
2. We will see how to apply different smoothing methods and models to the data in order to obtain usefull residue
variables.
* [Modelling](#Modelling)
* [Rolling Mean](#Rolling-Mean)
* [Rolling Median](#Rolling-Median)
* [Polynomial Fit](#Polynomial-Fit)
* [Custom Models](#Custom-Models)
* [Evaluation and Visualisation](#Evaluation-and-Visualisation)
3. We will see how we can obtain residues and scores from the calculated model curves.
* [Residues and Scores](#Residues-and-Scores)
* [Residues](#Residues)
* [Scores](#Scores)
* [Optimization from Decomposition](#Optimization-from-Decomposition)
4. Finally, we will see how to derive flags from the scores itself and impose additional conditions, functioning as
correctives.
* [Setting and Correcting Flags](#Setting-and-Correcting-Flags)
* [Flagging the Scores](#Flagging-the-Scores)
* [Additional Conditions ("unflagging")](#Additional-Conditions)
* [Including Multiple Conditions](#Including-Multiple-Conditions)
## Preparation
### Data
The example [data set](https://git.ufz.de/rdm-software/saqc/-/blob/cookBux/sphinx-doc/ressources/data/incidentsLKG.csv)
is selected to be small, comprehendable and its single anomalous outlier
can be identified easily visually:
![](../ressources/images/cbooks_incidents1.png)
It can be downloaded from the saqc git [repository](https://git.ufz.de/rdm-software/saqc/-/blob/cookBux/sphinx-doc/ressources/data/incidentsLKG.csv).
The data represents incidents of SARS-CoV-2 infections, on a daily basis, as reported by the
[RKI](https://www.rki.de/DE/Home/homepage_node.html) in 2020.
In June, an extreme spike can be observed. This spike relates to an incidence of so called "superspreading" in a local
[meat factory](https://www.heise.de/tp/features/Superspreader-bei-Toennies-identifiziert-4852400.html).
For the sake of modelling the spread of Covid, it can be of advantage, to filter the data for such extreme events, since
they may not be consistent with underlying distributional assumptions and thus interfere with the parameter learning
process of the modelling. Also it can help to learn about the conditions severely facilitating infection rates.
To introduce into some basic `SaQC` workflows, we will concentrate on classic variance based outlier detection approaches.
### Initialisation
We initially want to import the data into our workspace. Therefore we import the [pandas](https://pandas.pydata.org/)
library and use its csv file parser [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
```python
import pandas as pd
i_data = pd.read_csv(data_path)
```
The resulting `i_data` variable is a pandas [data frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
object. We can generate an SaQC object directly from that. Beforehand we have to make sure, the index
of `ì_data` is of the right type.
```python
i_data.index = pd.DatetimeIndex(i_data.index)
```
Now we do load the saqc package into the workspace and generate an instance of `saqc <saqc.core.core.SaQC>` object,
that refers to the loaded data.
```python
import saqc
i_saqc = saqc.SaQC(i_data)
```
With evaluating :py:attr:`saqc.fields`, we can check out the variables, present in the data.
```python
>>> saqc.fields
['incidents']
```
So, the only data present, is the *incidents* dataset. We can have a look at the data and obtain the above plot through
the method :py:meth:`saqc.show <saqc.core.core.SaQC.show>`:
```python
>>> saqc.show('incidents')
```
## Modelling
First, we want to model our data in order to obtain a stationary, residuish variable with zero mean.
### Rolling Mean
Easiest thing to do, would be, to apply some rolling mean
model via the method :py:func:`saqc.roll <Functions.saqc.roll>`.
```python
>>> i_saqc = i_saqc.roll(field='incidents', target='incidents_mean', func=np.mean, winsz='13D')
```
The :py:attr:`field` parameter is passed the variable name, we want to calculate the rolling mean of.
The :py:attr:`target` parameter holds the name, we want to store the results of the calculation to.
The :py:attr:`winsz` parameter controlls the size of the rolling window. It can be fed any so called [date alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) string. We chose the rolling window to have a 13 days span.
### Rolling Median
You can pass arbitrary function objects to the :py:attr:`func` parameter, to be applied to calculate every single windows "score".
For example, you could go for the *median* instead of the *mean*. The numpy library provides a [median](https://numpy.org/doc/stable/reference/generated/numpy.median.html) function
under the name `ǹp.median`. We just calculate another model curve for the `"incidents"` data with the `np.median` function from the `numpy` library.
```python
>>> i_saqc = i_saqc.roll(field='incidents', target='incidents_median', func=np.median, winsz='13D')
```
We chose another :py:attr:`target` value for the rolling *median* calculation, in order to not override our results from
the previous rolling *mean* calculation.
The :py:attr:`target` parameter can be passed to any call of a function from the
saqc functions pool and will determine the result of the function to be written to the
data, under the fieldname specified by it. If there already exists a field with the name passed to `target`,
the data stored to this field will be overridden.
We will evaluate and visualize the different model curves [later](#Evaluation-and-Visualisation).
Beforehand, we will generate some more model data.
### Polynomial Fit
Another common approach, is, to fit polynomials of certain degrees to the data.
:py:class:`SaQC <saqc.core.core.SaQC>` provides the polynomial fit function :py:func:`saqc.fitPolynomial <Functions.saqc.fitPolynomial>`:
```python
>>> i_saqc = i_saqc.fitPolynomial(field='incidents', target='incidents_polynomial', polydeg=2 ,winsz='13D')
```
It also takes a :py:attr:`winsz` parameter, determining the size of the fitting window.
The parameter, :py:attr:`polydeg` refers to the size of the rolling window, the polynomials get fitted to.
### Custom Models
If you want to apply a completely arbitrary function to your data, without pre-chunking it by a rolling window,
you can make use of the more general :py:func:`saqc.processGeneric <Functions.saqc.process>` function.
Lets apply a smoothing filter from the [scipy.signal](https://docs.scipy.org/doc/scipy/reference/signal.html)
module. We wrap the filter generator up into a function first:
```python
from scipy.signal import filtfilt, butter
def butterFilter(x, filter_order, nyq, cutoff, filter_type):
b, a = butter(N=filter_order, Wn=cutoff / nyq, btype=filter_type)
return filtfilt(b, a, x)
```
This function object, we can pass on to the :py:func:`saqc.processGeneric <Functions.saqc.process>` methods :py:attr:`func` argument.
([Here](sphinx-doc/getting_started_md/GenericFunctions.md) can
be found some more information on the generic Functions)
```python
i_saqc = i_saqc.processGeneric(field='incidents', target='incidents_lowPass', func=lambda x: butterFilter(x, cutoff=0.1, nyq=0.5, filter_order=2))
```
## Evaluation and Visualisation
Now, we can evaluate the data processing functions qeued to the :py:class:`SaQC <saqc.core.core.SaQC>` object with the
:py:func:`saqc.evaluate <saqc.core.core.SaQC.evaluate>` method.
```python
>>> i_saqc = i_saqc.evaluate()
```
This will give us an updated :py:class:`SaQC <saqc.core.core.SaQC>` object, in wich the internal data informations
are updated according to the methods we stacked to be applied.
We can obtain those updated informations by generating a [pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)
representation of it, with the :py:meth:`saqc.getResult <saqc.core.core.SaQC.getResult>` method:
```python
>>> data = i_saqc.getResult()[0]
```
To see all the results obtained so far, plotted in one figure window, we make use of the dataframes [plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method.
```python
>>> data.plot()
```
![](../ressources/images/cbooks_incidents2.png)
<<<<<<< HEAD
## Residues and Scores
### Residues
We want to evaluate the residues of one of our models model, in order to score the outlierish-nes of every point.
Therefor we just stick to the initially calculated rolling mean curve.
First, we retrieve the residues via the :py:func:`saqc.processGeneric <Functions.saqc.process>` method.
This method always comes into play, when we want to obtain variables, resulting from basic algebraic
manipulations of one or more input variables.
For obtaining the models residues, we just subtract the model data from the original data and assign the result
of this operation to a new variable, called `incidents_residues`. This Assignment, we, as usual,
control via the :py:attr:`target` parameter.
```python
i_saqc = i_saqc.procesGeneric(['incidents', 'incidents_model'], target='incidents_residues', func=lambda x, y: x - y)
=======
We want to evaluate the residues of the model, in order to score the outlierish-nes of every point.
First, we retrieve the residues via the :py:func:`saqc.processGeneric <docs.func_modules.processGeneric>` method.
The method generates a new variable, resulting from the processing of other variables. It automatically
generates the field name it gets passed - so we do not have to generate new variable beforehand. The function we apply
is just the computation of the variables difference for any timestep.
```python
<<<<<<< HEAD
i_saqc = i_saqc.genericProcess('incidents_residues', func=lambda incidents, incidents_model:incidents - incidents_model)
>>>>>>> develop
=======
i_saqc = i_saqc.processGeneric('incidents_residues', func=lambda incidents, incidents_model:incidents - incidents_model)
>>>>>>> develop
```
### Scores
Next, we score the residues simply by computing their [Z-scores](https://en.wikipedia.org/wiki/Standard_score).
The Z-score of a point $`x`$, relative to its surrounding $`D`$, evaluates to $`Z(x) = \frac{x - \mu(D)}{\sigma(D)}`$.
So, if we would like to roll with a window of a fixed size of *27* periods through the data and calculate the *Z*-score
for the point lying in the center of every window, we would define our function `z_score`:
```python
z_score = lambda D: abs((D[14] - np.mean(D)) / np.std(D))
```
And subsequently, use the :py:func:`saqc.roll <Functions.saqc.roll>` method to make a rolling window application with the scoring
function:
```python
i_saqc = i_saqc.roll(field='incidents_residues', target='incidents_scores', func=z_scores, winsz='13D')
```
### Optimization by Decomposition
There are 2 problems with the attempt presented [above](#Scores).
First, the rolling application of the customly
defined function, might get really slow for large data sets, because our function `z_scores` does not get decomposed into optimized building blocks.
Second, and maybe more important, it relies heavily on every window having a fixed number of values and a fixed temporal extension.
Otherwise, `D[14]` might not always be the value in the middle of the window, or it might not even exist,
and an error will be thrown.
So the attempt works fine, only because our data set is small and strictly regularily sampled.
Meaning that it has constant temporal distances between subsequent meassurements.
In order to tweak our calculations and make them much more stable, it might be useful to decompose the scoring
into seperate calls to the :py:func: `saqc.roll <Functions.saqc.roll>` function, by calculating the series of the
residues *mean* and *standard deviation* seperately:
```python
i_saqc = i_saqc.rolling.roll(field='incidents_residues', target='residues_mean',
window='27D',
func=np.mean)
i_saqc = i_saqc.rolling.roll(field='incidents_residues', target='residues_std',
window='27D',
func=np.std)
i_saqc = i_saqc.processGeneric(field='incidents_scores',
func=lambda This, residues_mean, residues_std: (
This - residues_mean) / residues_std)
```
With huge datasets, this will be noticably faster, compared to the method presented [initially](#Scores),
because `saqc` dispatches the rolling with the basic numpy statistic methods to an optimized pandas built-in.
Also, as a result of the :py:func: `saqc.roll <Functions.saqc.roll>` assigning its results to the center of every window,
all the values are centered and we dont have to care about window center indices when we are generating
the *Z*-Scores from the two series.
We simply combine them via the
:py:func:`saqc.processGeneric <Functions.saqc.generic>` method, in order to obtain the scores:
```python
i_saqc = i_saqc.processGeneric(fields=['incidents_residues','incidents_mean','incidents_std'], target='incidents_scores', func=lambda x,y,z: abs((x-y) / z))
```
Lets evaluate the residues calculation and have a look at the resulting scores:
```python
i_saqc = i_saqc.evaluate()
i_saqc.show('incidents_scores')
```
![](../ressources/images/cbook_incidents_scoresUnflagged.png)
## Setting and correcting Flags
### Flagging the Scores
We can now implement the common [rule of thumb](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule),
that any *Z*-score value above *3* may indicate an outlierish data point,
by applying the :py:func:`saqc.flagRange <Functions.saqc.flagRange>` method with a :py:attr:`max` value of *3*.
```python
i_saqc = i_saqc.flagRange('incidents_scores', max=3).evaluate()
```
Now flags have been calculated for the scores:
```python
>>> i_saqc.show('incidents_scores')
```
![](../ressources/images/cbooks_incidents_scores.png)
### Projecting Flags
We now can project those flags onto our original incidents timeseries:
```python
>>> i_saqc = i_saqc.flagGeneric(field=['incidents_scores'], target='incidents', func=lambda x: isFlagged(x))
```
Note, that we could have skipped the [range flagging step](#Flagging-the-scores), by including the cutting off in our
generic expression:
```python
>>> i_saqc = i_saqc.flagGeneric(field=['incidents_scores'], target='incidents', func=lambda x: x > 3)
```
Lets check out the results:
```python
>>> i_saqc = i_saqc.evaluate()
>>> i_saqc.show('incidents')
```
![](../ressources/images/cbooks_incidentsOverflagged.png)
Obveously, there are some flags set, that, relative to their 13 days surrounding, might relate to minor incidents spikes,
but may not relate to superspreading events we are looking for.
Especially the left most flag seems not to relate to an extreme event at all.
This overflagging stems from those values having a surrounding with very low data variance, and thus, evaluate to a relatively high Z-score.
There are a lot of possibilities to tackle the issue. In the next section, we will see how we can improve the flagging results
by incorporating additional domain knowledge.
## Additional Conditions
In order to improve our flagging result, we could additionally assume, that the events we are interested in,
are those with an incidents count that is deviating by a margin of more than
*20* from the 2 week average.
This is equivalent to imposing the additional condition, that an outlier must relate to a sufficiently large residue.
### Unflagging
We can do that posterior to the preceeding flagging step, by *removing*
some flags based on some condition.
In orer want to *unflag* those values, that do not relate to
sufficiently large residues, we assign them the :py:const:`unflagged <saqc.constants.UNFLAGGED>` flag.
Therefore, we make use of the :py:func:`saqc.flagGeneric <Functions.saqc.flag>` method.
This method usually comes into play, when we want to assign flags based on the evaluation of logical expressions.
So, we check out, which residues evaluate to a level below *20*, and assign the
flag value for :py:const:`unflagged <saqc.constants.UNFLAGGED>`. This value defaults to
to `-np.inf` in the default translation scheme, wich we selected implicitly by not specifying any special scheme in the
generation of the :py:class:`SaQC <saqc.core.core.SaQC>` object in the [beginning](#Initialisation).
```python
>>> i_saqc = i_saqc.flagGeneric(field=['incidents','incidents_residues'], func=lambda x,y: isflagged(x) & (y < 50), flag=-np.inf)
```
Notice, that we passed the desired flag level to the :py:attr:`flag` keyword in order to perform an
"unflagging" instead of the usual flagging. The :py:attr:`flag` keyword can be passed to all the functions
and defaults to the selected translation schemes :py:const:`bad <saqc.constants.BAD>` flag level.
Evaluation and showing proofs the tweaking did in deed improve the flagging result:
```python
>>> i_saqc = i_saqc.evaluate()
>>> i_saqc.show()
```
![](../ressources/images/cbooks_incidents_correctFlagged.png)
### Including multiple Conditions
If we do not want to first set flags, only to remove the majority of them in the next step, we also
could circumvent the [unflagging](#Unflagging) step, by adding to the call to :py:func:`saqc.flagRange <Functions.saqc.flagRange>` the condition for the residues having to be above *20*
```python
>>> i_saqc = i_saqc.flagGeneric(field=['incidents_scores', 'incidents_residues'], target='incidents', func=lambda x, y: (x > 3) & (y > 20))
>>> i_saqc = i_saqc.evaluate()
>>> i_saqc.show()
```
![](../ressources/images/cbooks_incidents_correctFlagged.png)
# Configuration Files
The behaviour of SaQC can be completely controlled by a text based configuration file.
## Format
SaQC expects configuration files to be semicolon-separated text files with a
fixed header. Each row of the configuration file lists
one variable and one or several test functions that are applied on the given variable.
### Header names
The header names are basically fixed, but if you really insist in custom
configuration headers have a look [here](saqc/core/config.py).
| Name | Data Type | Description | Required |
|---------|----------------------------------------------|------------------------|----------|
| varname | string | name of a variable | yes |
| test | [function notation](#test-function-notation) | test function | yes |
| plot | boolean (`True`/`False`) | plot the test's result | no |
### Test function notation
The notation of test functions follows the function call notation of Python and
many other programming languages and looks like this:
```
flagRange(min=0, max=100)
```
Here the function `flagRange` is called and the values `0` and `100` are passed
to the parameters `min` and `max` respectively. As we value readablity
of the configuration more than conciseness of the extrension language, only
keyword arguments are supported. That means that the notation `flagRange(0, 100)`
is not a valid replacement for the above example.
## Examples
### Single Test
Every row lists one test per variable. If you want to call multiple tests on
a specific variable (and you probably want to), list them in separate rows:
```
varname | test
#-------|----------------------------------
x | flagMissing()
x | flagRange(min=0, max=100)
x | constants_flagBasic(window="3h")
y | flagRange(min=-10, max=40)
```
### Multiple Tests
A row lists multiple tests for a specific variable in separate columns. All test
columns need to share the common prefix `test`:
```
varname ; test_1 ; test_2 ; test_3
#-------;----------------------------;---------------------------;---------------------------------
x ; flagMissing() ; flagRange(min=0, max=100) ; constants_flagBasic(window="3h")
y ; flagRange(min=-10, max=40) ; ;
```
The evaluation of such a configuration is in columns-major order, so the given
example is identical to the following:
```
varname ; test_1
#-------;---------------------------------
x ; flagMissing()
y ; flagRange(min=-10, max=40)
x ; flagRange(min=0, max=100)
x ; constants_flagBasic(window="3h")
```
### Plotting
As the process of finding a good quality check setup is somewhat experimental, SaQC
provides a possibility to plot the results of the test functions. To use this feature, add the optional column `plot` and set it
to `True` for all results you want to plot. These plots are
meant to provide a quick and easy visual evaluation of the test.
```
varname ; test ; plot
#-------;----------------------------------;-----
x ; flagMissing() ;
x ; flagRange(min=0, max=100) ; False
x ; constants_flagBasic(window="3h") ; True
y ; flagRange(min=-10, max=40)` ;
```
### Regular Expressions in `varname` column
Some of the tests (e.g. checks for missing values, range tests or interpolation
functions) are very likely to be used on all or at least several variables of
the processed dataset. As it becomes quite cumbersome to list all these
variables seperately, only to call the same functions with the same
parameters, SaQC supports regular expressions
within the `varname` column. Please not that a `varname` needs to be quoted
(with `'` or `"`) in order to be interpreted as a regular expression.
```
varname ; test
#----------;------------------------------
'.*' ; harm_shift2Grid(freq="15Min")
'(x \| y)' ; flagMissing()
```
\ No newline at end of file
# Customizations
SaQC comes with a continuously growing number of pre-implemented
[quality check and processing routines](sphinx-doc/getting_started_md/FunctionIndex.md) and
flagging schemes.
For any sufficiently large use case however it is very likely that the
functions provided won't fulfill all your needs and requirements.
Acknowledging the impossibility to address all imaginable use cases, we
designed the system to allow for extensions and costumizations. The main extensions options, namely
[quality check routines](#custom-quality-check-routines)
and the [flagging scheme](#custom-flagging-schemes)
are described within this documents.
## Custom quality check routines
In case you are missing quality check routines, you are of course very
welcome to file a feature request issue on the project's
[gitlab repository](https://git.ufz.de/rdm-software/saqc). However, if
you are more the "no-way-I-get-this-done-by-myself" type of person,
SaQC provides two ways to integrate custom routines into the system:
1. The [extension language](sphinx-doc/getting_started_md/GenericFunctions.md)
2. An [interface](#interface) to the evaluation machinery
### Interface
In order to make a function usable within the evaluation framework of SaQC the following interface is needed:
```python
import pandas
import dios
import saqc
def yourTestFunction(
data: pandas.DataFrame,
field: str,
flags: saqc.Flags,
*args,
**kwargs
) -> (dios.DictOfSeries, saqc.Flags)
```
#### Argument Descriptions
| Name | Description |
|-----------|--------------------------------------------------------------------------------------------------|
| `data` | The actual dataset. |
| `field` | The field/column within `data`, that function is processing. |
| `flags` | An instance of Flags, responsible for the translation of test results into quality attributes. |
| `args` | Any other arguments needed to parameterize the function. |
| `kwargs` | Any other keyword arguments needed to parameterize the function. |
### Integrate into SaQC
In order make your function available to the system it needs to be registered. We provide a decorator
[`register`](saqc/functions/register.py) with saqc, to integrate your
test functions into SaQC. Here is a complete dummy example:
```python
from saqc import register
@register()
def yourTestFunction(data, field, flags, *args, **kwargs):
return data, flags
```
### Example
The function [`flagRange`](saqc/funcs/functions.py) provides a simple, yet complete implementation of
a quality check routine. You might want to look into its implementation as a reference for your own.
## Custom flagging schemes
Sorry for the inconvenience! Coming soon...
# DMP flagging scheme
## Possible flags
The DMP scheme produces the following flag constants:
* "ok"
* "doubtfull"
* "bad"
# Generic Functions
## Generic Flagging Functions
Generic flagging functions provide for cross-variable quality
constraints and to implement simple quality checks directly within the configuration.
### Why?
In most real world datasets many errors
can be explained by the dataset itself. Think of a an active, fan-cooled
measurement device: no matter how precise the instrument may work, problems
are to be expected when the fan stops working or the power supply
drops below a certain threshold. While these dependencies are easy to
[formalize](#a-real-world-example) on a per dataset basis, it is quite
challenging to translate them into generic source code.
### Specification
Generic flagging functions are used in the same manner as their
[non-generic counterparts](sphinx-doc/getting_started_md/FunctionIndex.md). The basic
signature looks like that:
```sh
flagGeneric(func=<expression>, flag=<flagging_constant>)
```
where `<expression>` is composed of the [supported constructs](#supported-constructs)
and `<flag_constant>` is one of the predefined
[flagging constants](ParameterDescriptions.md#flagging-constants) (default: `BAD`).
Generic flagging functions are expected to return a boolean value, i.e. `True` or `False`. All other expressions will
fail during the runtime of `SaQC`.
### Examples
#### Simple comparisons
##### Task
Flag all values of `x` where `y` falls below 0.
##### Configuration file
```
varname ; test
#-------;------------------------
x ; flagGeneric(func=y < 0)
```
#### Calculations
##### Task
Flag all values of `x` that exceed 3 standard deviations of `y`.
##### Configuration file
```
varname ; test
#-------;---------------------------------
x ; flagGeneric(func=x > std(y) * 3)
```
#### Special functions
##### Task
Flag all values of `x` where: `y` is flagged and `z` has missing values.
##### Configuration file
```
varname ; test
#-------;----------------------------------------------
x ; flagGeneric(func=isflagged(y) & ismissing(z))
```
#### A real world example
Let's consider the following dataset:
| date | meas | fan | volt |
|------------------|------|-----|------|
| 2018-06-01 12:00 | 3.56 | 1 | 12.1 |
| 2018-06-01 12:10 | 4.7 | 0 | 12.0 |
| 2018-06-01 12:20 | 0.1 | 1 | 11.5 |
| 2018-06-01 12:30 | 3.62 | 1 | 12.1 |
| ... | | | |
##### Task
Flag `meas` where `fan` equals 0 and `volt`
is lower than `12.0`.
##### Configuration file
There are various options. We can directly implement the condition as follows:
```
varname ; test
#-------;-----------------------------------------------
meas ; flagGeneric(func=(fan == 0) \| (volt < 12.0))
```
But we could also quality check our independent variables first
and than leverage this information later on:
```
varname ; test
#-------;----------------------------------------------------
'.*' ; flagMissing()
fan ; flagGeneric(func=fan == 0)
volt ; flagGeneric(func=volt < 12.0)
meas ; flagGeneric(func=isflagged(fan) \| isflagged(volt))
```
## Generic Processing
Generic processing functions provide a way to evaluate mathmetical operations
and functions on the variables of a given dataset.
### Why
In many real-world use cases, quality control is embedded into a larger data
processing pipeline and it is not unusual to even have certain processing
requirements as a part of the quality control itself. Generic processing
functions make it easy to enrich a dataset through the evaluation of a
given expression.
### Specification
The basic signature looks like that:
```sh
procGeneric(func=<expression>)
```
where `<expression>` is composed of the [supported constructs](#supported-constructs).
## Variable References
All variables of the processed dataset are available within generic functions,
so arbitrary cross references are possible. The variable of interest
is furthermore available with the special reference `this`, so the second
[example](#calculations) could be rewritten as:
```
varname ; test
#-------;------------------------------------
x ; flagGeneric(func=this > std(y) * 3)
```
When referencing other variables, their flags will be respected during evaluation
of the generic expression. So, in the example above only values of `x` and `y`, that
are not already flagged with `BAD` will be used the avaluation of `x > std(y)*3`.
## Supported constructs
### Operators
#### Comparison
The following comparison operators are available:
| Operator | Description |
|----------|----------------------------------------------------------------------------------------------------|
| `==` | `True` if the values of the operands are equal |
| `!=` | `True` if the values of the operands are not equal |
| `>` | `True` if the values of the left operand are greater than the values of the right operand |
| `<` | `True` if the values of the left operand are smaller than the values of the right operand |
| `>=` | `True` if the values of the left operand are greater or equal than the values of the right operand |
| `<=` | `True` if the values of the left operand are smaller or equal than the values of the right operand |
#### Arithmetics
The following arithmetic operators are supported:
| Operator | Description |
|----------|----------------|
| `+` | addition |
| `-` | subtraction |
| `*` | multiplication |
| `/` | division |
| `**` | exponentiation |
| `%` | modulus |
#### Bitwise
The bitwise operators also act as logical operators in comparison chains
| Operator | Description |
|----------|-------------------|
| `&` | binary and |
| `\|` | binary or |
| `^` | binary xor |
| `~` | binary complement |
### Functions
All functions expect a [variable reference](#variable-references)
as the only non-keyword argument (see [here](#special-functions))
#### Mathematical Functions
| Name | Description |
|-------------|-----------------------------------|
| `abs` | absolute values of a variable |
| `max` | maximum value of a variable |
| `min` | minimum value of a variable |
| `mean` | mean value of a variable |
| `sum` | sum of a variable |
| `std` | standard deviation of a variable |
| `len` | the number of values for variable |
#### Special Functions
| Name | Description |
|-------------|-----------------------------------|
| `ismissing` | check for missing values |
| `isflagged` | check for flags |
### Constants
Generic functions support the same constants as normal functions, a detailed
list is available [here](ParameterDescriptions.md#constants).
# Getting started with SaQC
Requirements: this tutorial assumes that you have Python version 3.6.1 or newer
installed, and that both your operating system and Python version are in 64-bit.
## Contents
1. [Set up your environment](#1-set-up-your-environment)
2. [Get SaQC](#2-get-saqc)
3. [Training tour](#3-training-tour)
* [3.1 Get toy data and configuration](#get-toy-data-and-configuration)
* [3.2 Run SaQC](#run-saqc)
* [3.3 Configure SaQC](#configure-saqc)
* [Change test parameters](#change-test-parameters)
* [3.4 Explore the functionality](#explore-the-functionality)
* [Process multiple variables](#process-multiple-variables)
* [Data harmonization and custom functions](#data-harmonization-and-custom-functions)
* [Save outputs to file](#save-outputs-to-file)
## 1. Set up your environment
SaQC is written in Python, so the easiest way to set up your system to use SaQC
for your needs is using the Python Package Index (PyPI). Following good Python
practice, you will first want to create a new virtual environment that you
install SaQC into by typing the following in your console:
##### On Unix/Mac-systems
```sh
# if you have not installed venv yet, do so:
python3 -m pip install --user virtualenv
# move to the directory where you want to create your virtual environment
cd YOURDIR
# create virtual environment called "env_saqc"
python3 -m venv env_saqc
# activate the virtual environment
source env_saqc/bin/activate
```
##### On Windows-systems
```sh
# if you have not installed venv yet, do so:
py -3 -m pip install --user virtualenv
# move to the directory where you want to create your virtual environment
cd YOURDIR
# create virtual environment called "env_saqc"
py -3 -m venv env_saqc
# move to the Scripts directory in "env_saqc"
cd env_saqc/Scripts
# activate the virtual environment
./activate
```
## 2. Get SaQC
### Via PyPI
Type the following:
##### On Unix/Mac-systems
```sh
python3 -m pip install saqc
```
##### On Windows-systems
```sh
py -3 -m pip install saqc
```
### From Gitlab repository
Download SaQC directly from the [GitLab-repository](https://git.ufz.de/rdm/saqc) to make sure you use the most recent version:
```sh
# clone gitlab - repository
git clone https://git.ufz.de/rdm-software/saqc
# switch to the folder where you installed saqc
cd saqc
# install all required packages
pip install -r requirements.txt
# install all required submodules
git submodule update --init --recursive
```
## 3. Training tour
The following passage guides you through the essentials of the usage of SaQC via
a toy dataset and a toy configuration.
### Get toy data and configuration
If you take a look into the folder `saqc/ressources/data` you will find a toy
dataset `data.csv` which contains the following:
Date,Battery,SM1,SM2
2016-04-01 00:05:48,3573,32.685,29.3157
2016-04-01 00:20:42,3572,32.7428,29.3157
2016-04-01 00:35:37,3572,32.6186,29.3679
2016-04-01 00:50:32,3572,32.736999999999995,29.3679
...
These are two timeseries of soil moisture (SM1+2) and the battery voltage of the
measuring device over time. Generally, this is the way that your data should
look like to run saqc. Note, however, that you do not necessarily need a series
of dates to reference to and that you are free to use more columns of any name
that you like.
Now create your our own configuration file `saqc/ressources/data/myconfig.csv`
and paste the following lines into it:
varname;test;plot
SM2;flagRange(min=10, max=60);False
SM2;flagMad(window="30d", z=3.5);True
These lines illustrate how different quality control tests can be specified for
different variables by following the pattern:
*varname*|;| *testname (testparameters)*|;| *plotting option*|
:---------------|:------|:------|:----|:--|
In this case, we define a range-test that flags all values outside the range
[10,60] and a test to detect spikes using the MAD-method. You can find an
overview of all available quality control tests in the
[documentation](FunctionIndex.md). Note that the tests are
_executed in the order that you define in the configuration file_. The quality
flags that are set during one test are always passed on to the subsequent one.
### Run SaQC
Remember to have your virtual environment activated:
##### On Unix/Mac-systems
```sh
source env_saqc/bin/activate
```
##### On Windows
```sh
cd env_saqc/Scripts
./activate
```
Via your console, move into the folder you downloaded saqc into:
```sh
cd saqc
```
From here, you can run saqc and tell it to run the tests from the toy
config-file on the toy dataset via the `-c` and `-d` options:
##### On Unix/Mac-systems
```sh
python3 -m saqc -c ressources/data/myconfig.csv -d ressources/data/data.csv
```
##### On Windows
```sh
py -3 -m saqc -c ressources/data/myconfig.csv -d ressources/data/data.csv
```
If you installed saqc via PYPi, you can omit ```sh python -m```.
The command will output this plot:
![Toy Plot](../ressources/images/example_plot_1.png "Toy Plot")
So, what do we see here?
* The plot shows the data as well as the quality flags that were set by the
tests for the variable `SM2`, as defined in the config-file
* Following our definition in the config-file, first the `flagRange`-test that flags
all values outside the range [10,60] was executed and after that,
the `flagMad`-test to identify spikes in the data
* In the config, we set the plotting option to `True` for `flagMad`,
only. Thus, the plot aggregates all preceeding tests (here: `range`) to black
points and highlights the flags of the selected test as red points.
#### Save outputs to file
If you want the final results to be saved to a csv-file, you can do so by the
use of the `-o` option:
```sh
saqc -c ressources/data/config.csv -d ressources/data/data.csv -o ressources/data/out.csv
```
Which saves a dataframe that contains both the original data and the quality
flags that were assigned by SaQC for each of the variables:
Date,SM1,SM1_flags,SM2,SM2_flags
2016-04-01 00:05:48,32.685,OK,29.3157,OK
2016-04-01 00:20:42,32.7428,OK,29.3157,OK
2016-04-01 00:35:37,32.6186,OK,29.3679,OK
2016-04-01 00:50:32,32.736999999999995,OK,29.3679,OK
...
### Configure SaQC
#### Change test parameters
Now you can start to change the settings in the config-file and investigate the
effect that has on how many datapoints are flagged as "BAD". When using your
own data, this is your way to configure the tests according to your needs. For
example, you could modify your `myconfig.csv` and change the parameters of the
range-test:
varname;test;plot
SM2;flagRange(min=-20, max=60);False
SM2;flagMad(window="30d", z=3.5);True
Rerunning SaQC as above produces the following plot:
![Changing the config](../ressources/images/example_plot_2.png "Changing the config")
You can see that the changes that we made to the parameters of the range test
take effect so that only the values > 60 are flagged by it (black points). This,
in turn, leaves more erroneous data that is then identified by the proceeding
spike-test (red points).
### Explore the functionality
#### Process multiple variables
You can also define multiple tests for multiple variables in your data. These
are then executed sequentially and can be plotted seperately. E.g. you could do
something like this:
varname;test;plot
SM1;flagRange(min=10, max=60);False
SM2;flagRange(min=10, max=60);False
SM1;flagMad(window="15d", z=3.5);True
SM2;flagMad(window="30d", z=3.5);True
which gives you separate plots for each line where the plotting option is set to
`True` as well as one summary "data plot" that depicts the joint flags from all
tests:
|SM1 | SM2 |
|:-------------------------:|:-------------------------:|
|here | there|
|SM1 | SM2 |
|:-------------------------:|:-------------------------:|
| ![](../ressources/images/example_plot_31.png) | ![](../ressources/images/example_plot_32.png)|
| ![](../ressources/images/example_plot_31.png) | bumm|
![](../ressources/images/example_plot_33.png)
#### Data harmonization and custom functions
SaQC includes functionality to harmonize the timestamps of one or more data
series. Also, you can write your own tests using a python-based
[extension language](sphinx-doc/getting_started_md/GenericFunctions.md). This would look like this:
varname;test;plot
SM2;shiftToFreq(freq="15Min");False
SM2;generic(func=(SM2 < 30));True
The above executes an internal framework that harmonizes the timestamps of SM2
to a 15min-grid (see data below). Further information about this routine can be
found in the :ref:`Flagging Functions Overview <flaggingFunctions>`.
Date,SM1,SM1_flags,SM2,SM2_flags
2016-04-01 00:00:00,,,29.3157,OK
2016-04-01 00:05:48,32.685,OK,,
2016-04-01 00:15:00,,,29.3157,OK
2016-04-01 00:20:42,32.7428,OK,,
...
Also, all values where SM2 is below 30 are flagged via the custom function (see
plot below). You can learn more about the syntax of these custom functions
[here](sphinx-doc/getting_started_md/GenericFunctions.md).
![Example custom function](../ressources/images/example_plot_4.png "Example custom function")
:py:func:`lala <docs.func_modules.outliers.flagRange>`
\ No newline at end of file
## Offset Strings
All the [pandas offset aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) ars supported by SaQC. The following table lists some of the more relevant options:
| Alias | Description |
| ----- | ----------- |
| `"S"`, `"s"` | second |
| `"T"`, `"Min"`, `"min"` | minute |
| `"H"`, `"h"` | hour |
| `"D"`, `"d"` | calendar day |
| `"W"`, `"w"` | week |
| `"M"`, `"m"` | month |
| `"Y"`, `"y"` | year |
Multiples are build by preceeding the alias with the desired multiply (e.g `"5Min"`, `"4W"`)
## Constants
### Flagging Constants
The following flag constants are available and can be used to mark the quality of a data point:
| Alias | Description |
| ---- | ---- |
| `GOOD` | A value did pass all the test and is therefore considered to be valid |
| `BAD` | At least on test failed on the values and is therefore considered to be invalid |
| `UNFLAGGED` | The value has not got a flag yet. This might mean, that all tests passed or that no tests ran |
How these aliases will be translated into 'real' flags (output of SaQC) dependes on the [flagging scheme](FlaggingSchemes.md)
and might range from numerical values to string constants.
### Numerical Constants
| Alias | Description |
| ---- | ---- |
| `NAN` | Not a number |
# Documentation Guide
We document our code via docstrings in numpy-style.
Features, install and usage instructions and other more text intense stuff,
is written in extra documents.
The documents and the docstrings then are collected and rendered using [sphinx](https://www.sphinx-doc.org/).
## Documentation Strings
- Write docstrings for all public modules, functions, classes, and methods.
Docstrings are not necessary for non-public methods,
but you should have a comment that describes what the method does.
This comment should appear after the def line.
[[PEP8](https://www.python.org/dev/peps/pep-0008/#documentation-strings)]
- Note that most importantly, the `"""` that ends a multiline docstring should be on a line by itself [[PEP8](https://www.python.org/dev/peps/pep-0008/#documentation-strings)] :
```python
"""Return a foobang
Optional plotz says to frobnicate the bizbaz first.
"""
```
- For one liner docstrings, please keep the closing `"""` on the same line.
[[PEP8](https://www.python.org/dev/peps/pep-0008/#documentation-strings)]
### Pandas Style
We use [Pandas-style](https://pandas.pydata.org/pandas-docs/stable/development/contributing_docstring.html) docstrings:
## Flagger, data, field, etc.
use this:
```py
def foo(data, field, flagger):
"""
data : dios.DictOfSeries
A saqc-data object.
field : str
A field denoting a column in data.
flagger : saqc.flagger.BaseFlagger
A saqc-flagger object.
"""
```
### IDE helper
In pycharm one can activate autogeneration of numpy doc style like so:
1. `File->Settings...`
2. `Tools->Python Integrated Tools`
3. `Docstrings->Docstring format`
4. Choose `NumPy`
### Docstring formatting pitfalls
* Latex is included via
```
:math:`<latex_code>`
```
* Latex commands need to be signified with **double** backlash! (``\\mu`` instead of ``\mu``)
* Nested lists need to be all of the same kind (either numbered or marked - otherwise result is salad)
* List items covering several lines in the docstring have to be all aligned - (so, not only the superfluent ones, but ALL, including the first one - otherwise result is salad)
* Start of a list has to be seperated from preceding docstring code by *one blank line* - (otherwise list items get just chained in one line and result is salad)
* Most formatting signifiers are not allowed to start or end with a space. (so no :math: \`1+1 \`, \` var2\`, \`\` a=1 \`\`, ...)
* Do not include lines *only* containing two or more `-` signs, except it is the underscore line of the section heading (otherwise resulting html representation could be messed up)
## hyperlinking docstrings
* Link code content/modules via python roles.
* Cite/link via the py domain roles. Link content `bar`, that is registered to the API with the adress `foo.bar` and
shall be represented by the name `link_name`, via:
```
:py:role:`link_name <foo.bar>`
```
* check out the *_api* folder in the [repository](https://git.ufz.de/rdm-software/saqc/-/tree/develop/sphinx-doc) to get an
overview of already registered paths. Most important may be:
* constants are available via `saqc.constants` - for example:
```
:py:const:`~saqc.constants.BAD`
```
* the ``~`` is a shorthand for hiding the module path and only displaying ``BAD``.
* Functions are available via the "simulated" module `Functions.saqc` - for example:
```
:py:func:`saqc.flagRange <saqc.Functions.flagRange>`
```
* Modules are available via the "simulated" package `Functions.` - for example:
```
:py:mod:`generic <Functions.generic>`
```
* The saqc object and/or its content is available via:
```
:py:class:`saqc.SaQC`
:py:meth:`saqc.SaQC.getResults`
```
* The Flags object and/or its content is available via:
```
:py:class:`saqc.Flags`
```
* you can add .rst files containing ``automodapi`` directives to the modulesAPI folder to make available more modules via pyroles
- the Environment table, including variables available via config files is available as restfile located in the environment folder. (Use include directive to include, or linking syntax to link it.
## Adding Markdown content to the Documentation
- By linking the markdown file "foo/bar.md", or any folder that contains markdown files directly,
you can trigger sphinx - `recommonmark`, which is fine for not-too complex markdown documents.
* Especially, if you have multiple markdown files that are mutually linked and/or, contain tables of certain fencieness (tables with figures),
you will have to take some minor extra steps:
- You will have to gather all markdown files in subfolders of "sphinx-doc" directory (you can have multiple subfolders).
- To include a folder named `foo` of markdown files in the documentation, or refer to content in `foo`, you will have
to append the folder name to the MDLIST variable in the Makefile:
- The markdown files must be in one of the subfolders listed in MDLIST - they cant be gathered in nested subfolders.
- You can not link to sections in other markdown files, that contain the `-` character (sorry).
- The Section structure/ordering must be consistent in the ReST sence (otherwise they wont appear - thats also required if you use plain `recommonmark`
- You can link to ressources - like pictures and include them in the markdown, if the pictures are in (possibly another) folder in `\sphinx-doc` and the paths to this ressources are given relatively!
- You can include a markdown file in a rest document, by appending '_m2r' to the folder name when linking it path_wise.
So, to include the markdown file 'foo/bar.md' in a toc tree for example - you would do something like:
- the Environment table, including variables availbe via config files is available as restfile located in the environment folder. (Use include directive to include, or linking syntax to link it.)
```python
.. toctree::
:hidden:
:maxdepth: 1
foo_m2r/bar
```
## Linking ReST sources in markdown documentation
- If you want to hyperlink/include other sources from the sphinx documentation that are rest-files (and docstrings),
you will not be able to include them in a way, that they will appear in you markdown rendering. - however - there is
the posibillity to just include the respective rest directives (see directive/link [examples](#hyperlinking-docstrings)).
- This will mess up your markdown code - meaning that you will have
those rest snippets flying around, but when the markdown file gets converted to the rest file and build into the
sphinx html build, the linked sources will be integrated properly. The syntax for linking rest sources is as
follows as follows:
- to include the link to the rest source `functions.rst` in the folder `foo`, under the name `bar`, you would need to insert:
```python
:doc:`foo <rel_path/functions>`
```
- to link to a section with name `foo` in a rest source named `bumm.rst`, under the name `bar`, you would just insert:
```
:ref:`bar <relative/path/from/sphinx/root/bumm:foo>`
```
- in that manner you might be able to smuggle most rest directives through into the resulting html build. Especially if you want to link to the docstrings of certain (domain specific) objects. Lets say you want to link to the *function* `saqc.funcs.flagRange` under the name `ranger` - you just include:
```
:py:func:`Ranger <saqc.funcs.flagRange>`
```
whereas the `:func:` part determines the role, the object is documented as. See [this page](https://www.sphinx-doc.org/en/master/#ref-role) for an overview of the available roles
# Exponential Drift Model and Correction
It is assumed, that, in between maintenance events, there is a drift effect shifting the measurements in a way, that the resulting value course can be described by the exponential model $`M`$:
$`M(t, a, b, c) = a + b(e^{ct}-1)`$
We consider the timespan in between maintenance events to be scaled to the $`[0,1]`$ interval.
To additionally make sure, the modeled curve can be used to calibrate the value course, we added the following two conditions.
$` M(0, a, b, c) = y_0 `$
$` M(1, a, b, c) = y_1 `$
With $` y_0 `$ denoting the mean value obtained from the first 6 meassurements directly after the last maintenance event, and $` y_1 `$ denoting the mean over the 6 meassurements, directly preceeding the beginning of the next maintenance event.
Solving the equation, one obtains the one-parameter Model:
$` M_{drift}(t, c) = y_0 + ( \frac{y_1 - y_0}{e^c - 1} ) (e^{ct} - 1) `$
For every datachunk in between maintenance events.
After having found the parameter $`c^*`$, that minimizes the squared residues between data and drift model, the correction is performed by bending the fitted curve, $`M_{drift}(t, c^*)`$, in a way, that it matches $`y_2`$ at $`t=1`$ (,with $`y_2`$ being the mean value observed directly after the end of the next maintenance event).
This bended curve is given by:
$` M_{shift}(t, c^{*}) = M(t, y_0, \frac{y_1 - y_0}{e^c - 1} , c^*) `$
the new values $`y_{shifted}`$ are computed via:
$`y_{shifted} = y + M_{shift} - M_{drift} `$
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment