Skip to content
Snippets Groups Projects
Commit c73f22d9 authored by David Schäfer's avatar David Schäfer
Browse files

updated FunctionDescriptions.md

parent f4186b5c
No related branches found
No related tags found
No related merge requests found
......@@ -10,10 +10,10 @@ Main documentation of the implemented functions, their purpose and parameters an
- [seasonalRange](#seasonalrange)
- [clear](#clear)
- [force](#force)
- [sliding_outlier](#sliding_outlier)
- [spikes_basic](#spikes_basic)
- [spikes_simpleMad](#spikes_simpleMad)
- [spikes_Basic](#spikes_basic)
- [Spikes_SpektrumBased](#spikes_spektrumbased)
- [spikes_slidingZscore](#spikes_slidingZscore)
- [spikes_spektrumBased](#spikes_spektrumBased)
- [constant](#constant)
- [constants_varianceBased](#constants_variancebased)
- [soilMoisture_plateaus](#soilmoisture_plateaus)
......@@ -141,52 +141,37 @@ force()
Force flags to a flag-value.
## sliding_outlier
Detect outlier/spikes by a given method in a sliding window.
## spikes_basic
```
sliding_outlier(winsz="1h", dx="1h", count=1, deg=1, z=3.5, method="modZ")
spikes_basic(thresh, tolerance, window_size)
```
| parameter | data type | default value | description |
| --------- | ----------- | ---- | ----------- |
| winsz | offset-string/integer | `"1h"` | size of the sliding window, the *method* is applied on |
| dx | offset-string/integer | `"1h"` | the step size the sliding window is continued after calculation |
| count | integer | `1` | the minimal count, a possible outlier needs, to be flagged |
| deg | integer | `1"` | the degree of the polynomial fit, to calculate the residual |
| z | float | `3.5` | z-parameter for the *method* (see description) |
| method | string | `"modZ"` | the method outlier are detected with |
| parameter | data type | default value | description |
| ------ | ------ | ------ | ---- |
| thresh | float | | Minimum jump margin for spikes. See condition (1). |
| tolerance | float | | Range of area, containing al "valid return values". See condition (2). |
| window_size | ftring | | An offset string, denoting the maximal length of "spikish" value courses. See condition (3). |
Parameter notes:
- `winsz` and `dx` must be of same type, mixing of offset and integer is not supported and will fail.
- if offset-strings only work with datetime indexed data
A basic outlier test, that is designed to work for harmonized, as well as raw
(not-harmonized) data.
The algorithm works as follows:
1. a window of size `winsz` is cut from the data
2. normalisation - (the data is fit by a polynomial of the given degree `deg`, which is subtracted from the data)
3. the outlier detection `method` is applied on the residual, and possible outlier are marked
4. the window (on the data) is continued by `dx` to the next data-slot
5. start over from 1. until the end of data is reached
6. all potential outlier, that are detected `count`-many times, are flagged as outlier
The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a passed timeseries $`x`$,
are considered spikes, if:
The possible outlier detection methods are *zscore* and *modZ*.
In the following description, the residual (calculated from a slice by the sliding window) is referred as *data*.
1. $`|x_{n-1} - x_{n + s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$
The **zscore** (Z-score) [1] mark every value as possible outlier, which fulfill:
```math
|r - m| > s * z
```
with $` r, m, s, z `$: data, data mean, data standard deviation, `z`.
2. $`|x_{n-1} - x_{n+k+1}| < `$ `tolerance`
The **modZ** (modified Z-score) [1] mark every value as possible outlier, which fulfill:
```math
0.6745 * |r - M| > mad * z > 0
```
with $` r, M, mad, z `$: data, data median, data median absolute deviation, `z`.
3. $` |y_{n-1} - y_{n+k+1}| < `$ `window_size`, with $`y `$, denoting the series
of timestamps associated with $`x `$.
See also:
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
By this definition, spikes are values, that, after a jump of margin `thresh`(1),
are keeping that new value level they jumped to, for a timespan smaller than
`window_size` (3), and do then return to the initial value level -
within a tolerance margin of `tolerance` (2).
Note, that this characterization of a "spike", not only includes one-value
outliers, but also plateau-ish value courses.
## spikes_simpleMad
......@@ -215,60 +200,70 @@ Note: This function should only applied on normalised data.
See also:
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
## spikes_slidingZscore
Detect outlier/spikes by a given method in a sliding window.
## Spikes_Basic
```
Spikes_Basic(thresh, tolerance, window_size)
spikes_slidingZscore(winsz="1h", dx="1h", count=1, deg=1, z=3.5, method="modZ")
```
| parameter | data type | default value | description |
| ------ | ------ | ------ | ---- |
| thresh | float | | Minimum jump margin for spikes. See condition (1). |
| tolerance | float | | Range of area, containing al "valid return values". See condition (2). |
| window_size | ftring | | An offset string, denoting the maximal length of "spikish" value courses. See condition (3). |
| parameter | data type | default value | description |
| --------- | ----------- | ---- | ----------- |
| winsz | offset-string/integer | `"1h"` | size of the sliding window, the *method* is applied on |
| dx | offset-string/integer | `"1h"` | the step size the sliding window is continued after calculation |
| count | integer | `1` | the minimal count, a possible outlier needs, to be flagged |
| deg | integer | `1"` | the degree of the polynomial fit, to calculate the residual |
| z | float | `3.5` | z-parameter for the *method* (see description) |
| method | string | `"modZ"` | the method outlier are detected with |
A basic outlier test, that is designed to work for harmonized, as well as raw
(not-harmonized) data.
The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a passed timeseries $`x`$,
are considered spikes, if:
1. $`|x_{n-1} - x_{n + s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$
2. $`|x_{n-1} - x_{n+k+1}| < `$ `tolerance`
Parameter notes:
- `winsz` and `dx` must be of same type, mixing of offset and integer is not supported and will fail.
- if offset-strings only work with datetime indexed data
3. $` |y_{n-1} - y_{n+k+1}| < `$ `window_size`, with $`y `$, denoting the series
of timestamps associated with $`x `$.
The algorithm works as follows:
1. a window of size `winsz` is cut from the data
2. normalisation - (the data is fit by a polynomial of the given degree `deg`, which is subtracted from the data)
3. the outlier detection `method` is applied on the residual, and possible outlier are marked
4. the window (on the data) is continued by `dx` to the next data-slot
5. start over from 1. until the end of data is reached
6. all potential outlier, that are detected `count`-many times, are flagged as outlier
By this definition, spikes are values, that, after a jump of margin `thresh`(1),
are keeping that new value level they jumped to, for a timespan smaller than
`window_size` (3), and do then return to the initial value level -
within a tolerance margin of `tolerance` (2).
The possible outlier detection methods are *zscore* and *modZ*.
In the following description, the residual (calculated from a slice by the sliding window) is referred as *data*.
Note, that this characterization of a "spike", not only includes one-value
outliers, but also plateau-ish value courses.
The **zscore** (Z-score) [1] mark every value as possible outlier, which fulfill:
```math
|r - m| > s * z
```
with $` r, m, s, z `$: data, data mean, data standard deviation, `z`.
The implementation is a time-window based version of an outlier test from the
UFZ Python library, that can be found [here](https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py).
The **modZ** (modified Z-score) [1] mark every value as possible outlier, which fulfill:
```math
0.6745 * |r - M| > mad * z > 0
```
with $` r, M, mad, z `$: data, data median, data median absolute deviation, `z`.
See also:
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
## Spikes_SpektrumBased
## spikes_spektrumBased
```
Spikes_SpektrumBased(raise_factor=0.15, dev_cont_factor=0.2,
spikes_spektrumBased(raise_factor=0.15, dev_cont_factor=0.2,
noise_barrier=1, noise_window_size="12h", noise_statistic="CoVar",
smooth_poly_order=2, filter_window_size=None)
```
| parameter | data type | default value | description |
| ------ | ------ | ------ | ---- |
| raise_factor | float | `0.15` | Minimum change margin for a datapoint to become a candidate for a spike. See condition (1). |
| dev_cont_factor | float | `0.2` | See condition (2). |
| noise_barrier | float | `1` | Upper bound for noisyness of data surrounding potential spikes. See condition (3).|
| noise_window_range | string | `"12h"` | Any offset string. Determines the range of the timewindow of the "surrounding" data of a potential spike. See condition (3). |
| noise_statistic | string | `"CoVar"` | Operator to calculate noisyness of data, surrounding potential spike. Either `"Covar"` (=Coefficient od Variation) or `"rvar"` (=relative Variance).|
| smooth_poly_order | integer | `2` | Order of the polynomial fit, applied for smoothing|
| filter_window_size | Nonetype or string | `None` | Options: <br/> - `None` <br/> - any offset string <br/><br/> Controlls the range of the smoothing window applied with the Savitsky-Golay filter. If None is passed (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If you are not very well knowing what you are doing - do not change that value. Broader window sizes caused unexpected results during testing phase.|
| parameter | data type | default value | description |
| ------ | ------ | ------ | ---- |
| raise_factor | float | `0.15` | Minimum change margin for a datapoint to become a candidate for a spike. See condition (1). |
| dev_cont_factor | float | `0.2` | See condition (2). |
| noise_barrier | float | `1` | Upper bound for noisyness of data surrounding potential spikes. See condition (3). |
| noise_window_range | string | `"12h"` | Any offset string. Determines the range of the timewindow of the "surrounding" data of a potential spike. See condition (3). |
| noise_statistic | string | `"CoVar"` | Operator to calculate noisyness of data, surrounding potential spike. Either `"Covar"` (=Coefficient od Variation) or `"rvar"` (=relative Variance). |
| smooth_poly_order | integer | `2` | Order of the polynomial fit, applied for smoothing |
| filter_window_size | Nonetype or string | `None` | Options: <br/> - `None` <br/> - any offset string <br/><br/> Controlls the range of the smoothing window applied with the Savitsky-Golay filter. If None is passed (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If you are not very well knowing what you are doing - do not change that value. Broader window sizes caused unexpected results during testing phase. |
The function detects and flags spikes in input data series by evaluating the
......@@ -302,7 +297,6 @@ Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture
Data from the international Soil Moisture Network. 2013. Vadoze Zone J.
doi:10.2136/vzj2012.0097.
## constant
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment