Skip to content
Snippets Groups Projects
Commit d91679f1 authored by David Schäfer's avatar David Schäfer
Browse files

Update SpikeDetection.md and parameter harmonizations

parent 00c313a2
No related branches found
No related tags found
No related merge requests found
# Spike Detection
A collection of quality check routines to find spike.
A collection of quality check routines to find spikes.
## Index
......@@ -16,16 +16,16 @@ A collection of quality check routines to find spike.
spikes_basic(thresh, tolerance, window_size)
```
| parameter | data type | default value | description |
|-----------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------|
| thresh | float | | Minimum difference between to values, to consider the latter one as a spike, see condition (1) |
| tolerance | float | | Maximum difference between pre-spike and post-spike values Range of area, see condition (2) |
| window | [offset string](docs/ParameterDescriptions.md#offset-strings) | | Maximum length of "spikish" value courses, see condition (3) |
| parameter | data type | default value | description |
|-----------|---------------------------------|---------------|------------------------------------------------------------------------------------------------|
| thresh | float | | Minimum difference between to values, to consider the latter one as a spike. See condition (1) |
| tolerance | float | | Maximum difference between pre-spike and post-spike values. See condition (2) |
| window | [offset string](offset-strings) | | Maximum length of "spikish" value courses. See condition (3) |
A basic outlier test, that is designed to work for harmonized, as well as raw
(not-harmonized) data.
The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a passed timeseries $`x`$,
The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a timeseries $`x`$,
are considered spikes, if:
1. $`|x_{n-1} - x_{n + s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$
......@@ -38,9 +38,10 @@ are considered spikes, if:
By this definition, spikes are values, that, after a jump of margin `thresh`(1),
are keeping that new value level, for a timespan smaller than
`window` (3), and then return to the initial value level -
within a tolerance of `tolerance` (2).
within a tolerance of `tolerance` (2).
Note, that this characterization of a "spike", not only includes one-value
NOTE:
This characterization of a "spike", not only includes one-value
outliers, but also plateau-ish value courses.
......@@ -50,105 +51,106 @@ outliers, but also plateau-ish value courses.
spikes_simpleMad(window="1h", z=3.5)
```
| parameter | data type | default value | description |
|-----------|-----------------------------------------------------------------------|---------------|----------------------------------------------------------------------|
| window | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | `"1h"` | size of the sliding window, where the modified Z-score is applied on |
| z | float | `3.5` | z-parameter of the modified Z-score |
| parameter | data type | default value | description |
|-----------|-----------------------------------------|---------------|----------------------------------------------------------------------|
| window | integer/[offset string](offset-strings) | `"1h"` | size of the sliding window, where the modified Z-score is applied on |
| z | float | `3.5` | z-parameter of the modified Z-score |
This functions flags outliers by simple median absolute deviation test.
The *modified Z-score* [1] is used to detect outliers. Values are flagged if
they fulfill the following condition within a sliding window:
This functions flags outliers using the simple median absolute deviation test.
Values are flagged if they fulfill the following condition within a sliding window:
```math
0.6745 * |x - m| > mad * z > 0
```
where $`x`$ denotes the window data, $`m`$ the window median, $`mad`$ the median absolute deviation and $`z`$ the
$`z`$-parameter of the modified Z-Score.
where $`x`$ denotes the window data, $`m`$ the window median, $`mad`$ the median
absolute deviation and $`z`$ the $`z`$-parameter of the modified Z-Score.
The window is moved by one time stamp at a time.
Note: This function should only be applied on normalized data.
See also:
NOTE:
This function should only be applied on normalized data.
References:
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
## spikes_slidingZscore
```
spikes_slidingZscore(window="1h", offset="1h", count=1, deg=1, z=3.5, method="modZ")
spikes_slidingZscore(window="1h", offset="1h", count=1, poly_deg=1, z=3.5, method="modZ")
```
| parameter | data type | default value | description |
|-----------|-----------------------------------------------------------------------|---------------|-------------------------------------------------------------|
| window | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | `"1h"` | size of the sliding window |
| offset | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | `"1h"` | offset between two consecutive windows |
| count | integer | `1` | the minimal count, a possible outlier needs, to be flagged |
| deg | integer | `1"` | the degree of the polynomial fit, to calculate the residual |
| z | float | `3.5` | z-parameter for the *method* (see description) |
| method | string | `"modZ"` | the method outlier are detected with |
| parameter | data type | default value | description |
|-----------|-----------------------------------------|---------------|-------------------------------------------------------------|
| window | integer/[offset string](offset-strings) | `"1h"` | size of the sliding window |
| offset | integer/[offset string](offset-strings) | `"1h"` | offset between two consecutive windows |
| count | integer | `1` | the minimal count a possible outlier needs, to be flagged |
| polydeg | integer | `1"` | the degree of the polynomial fit, to calculate the residual |
| z | float | `3.5` | z-parameter for the *method* (see description) |
| method | [string](#outlier-detection-methods) | `"modZ"` | the method to detect outliers |
Detect outlier/spikes using a given method within sliding windows.
This functions flags spikes using the given method within sliding windows.
NOTE:
- `window` and `offset` must be of same type, mixing of offset and integer is not supported and will fail
- `window` and `offset` must be of same type, mixing of offset- and integer-
based windows is not supported and will fail
- offset-strings only work with time-series-like data
The algorithm works as follows:
1. a window of size `window` is cut from the data
2. normalization - the data is fit by a polynomial of the given degree `deg`, which is subtracted from the data
2. normalization - the data is fit by a polynomial of the given degree `polydeg`, which is subtracted from the data
3. the outlier detection `method` is applied on the residual, possible outlier are marked
4. the window (on the data) is moved by `offset`
5. start over from 1. until the end of data is reached
6. all potential outliers, that are detected `count`-many times, are flagged as outlier
6. all potential outliers, that are detected `count`-many times, are flagged as outlier
The possible outlier detection methods are *zscore* and *modZ*.
In the following description, the residual (calculated from a slice by the sliding window)
is referred as *data*.
### Outlier Detection Methods
Currently two outlier detection methods are implemented:
The **zscore** (Z-score) [1] marks every value as possible outlier, which fulfill:
1. **zscore** (Z-score):
Marks every value as a possible outlier, which fulfills the follwing condition:
```math
|r - m| > s * z
```
where $`r`$ denotes the residual, $`m`$ the residual mean, $`s`$ the residual
standard deviation, and $`z`$ the $`z`$-parameter.
```math
|r - m| > s * z
```
where $`r`$ denotes the residual, $`m`$ the residual mean, $`s`$ the residual standard deviation,
and $`z`$ the $`z`$-parameter.
2. **modZ** (modified Z-score):
Marks every value as a possible outlier, which fulfills the follwing condition:
The **modZ** (modified Z-score) [1] marks every value as possible outlier, which fulfill:
```math
0.6745 * |r - m| > mad * z > 0
```
```math
0.6745 * |r - m| > mad * z > 0
```
where $`r`$ denotes the residual, $`m`$ the residual mean, $`mad`$ the residual median absolute
deviation, and $`z`$ the $`z`$-parameter.
where $`r`$ denotes the residual, $`m`$ the residual mean, $`mad`$ the residual median absolute
deviation, and $`z`$ the $`z`$-parameter.
See also:
### References
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
## spikes_spektrumBased
```
spikes_spektrumBased(raise_factor=0.15, dev_cont_factor=0.2,
noise_barrier=1, noise_window_size="12h", noise_statistic="CoVar",
smooth_poly_order=2, filter_window_size=None)
spikes_spektrumBased(raise_factor=0.15, deriv_factor=0.2,
noise_thresh=1, noise_window="12h", noise_func="CoVar",
ploy_deg=2, filter_window=None)
```
| parameter | data type | default value | description |
|-------------------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| raise_factor | float | `0.15` | Minimum margin of value change, a datapoint has to represent, to become a candidate for a spike. See condition (1). |
| dev_cont_factor | float | `0.2` | See condition (2). |
| noise_barrier | float | `1` | Upper bound for noisyness of data surrounding potential spikes. See condition (3). |
| noise_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | `"12h"` | offset string. Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3). |
| noise_statistic | string | `"CoVar"` | Operator to calculate noisyness of data, surrounding potential spikes. Either `"Covar"` (=Coefficient of Variation) or `"rvar"` (=relative Variance). |
| smooth_poly_order | integer | `2` | Order of the polynomial fit, applied with Savitsky-Golay-filter. |
| filter_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | `None` | Controls the range of the smoothing window applied with the Savitsky-Golay filter. If `None` (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If unsure, do not change that value. |
| parameter | data type | default value | description |
|---------------|--------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| raise_factor | float | `0.15` | Minimum relative value difference between two values to consider the latter as a spike candidate. See condition (1) |
| deriv_factor | float | `0.2` | See condition (2) |
| noise_thresh | float | `1` | Upper threshhold for noisyness of data surrounding potential spikes. See condition (3) |
| noise_window | [offset string](offset-strings) | `"12h"` | Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3) |
| noise_func | [string](#noise-detection-functions) | `"CoVar"` | Function to calculate noisyness of data, surrounding potential spikes |
| ploy_deg | integer | `2` | Order of the polynomial fit, applied with Savitsky-Golay-filter |
| filter_window | [offset string](offset-strings) | `None` | Controls the range of the smoothing window applied with the Savitsky-Golay filter. If `None` (default), the window size will be two times the sampling rate (thus, covering 3 values). If unsure, do not change that value |
The function detects and flags spikes in input data series by evaluating the
timeseries' derivatives and applying some conditions to them.
The function flags spikes by evaluating the timeseries' derivatives
and applying various conditions to them.
A datapoint $`x_k`$ of a dataseries $`x`$,
is considered a spike, if:
......@@ -158,22 +160,30 @@ is considered a spike, if:
* $` |\frac{x_k}{x_{k-1}}| < 1 - `$ `raise_factor`
2. The quotient of the data's second derivative $`x''`$, at the preceeding
and subsequent timestamps is close enough to 1:
* $` |\frac{x''_{k-1}}{x''_{k+1}} | > 1 - `$ `dev_cont_factor`, and
* $` |\frac{x''_{k-1}}{x''_{k+1}} | < 1 + `$ `dev_cont_factor`
* $` |\frac{x''_{k-1}}{x''_{k+1}} | > 1 - `$ `deriv_factor`, and
* $` |\frac{x''_{k-1}}{x''_{k+1}} | < 1 + `$ `deriv_factor`
3. The dataset, $`X_k`$, surrounding $`x_{k}`$, within `noise_window` range,
but excluding $`x_{k}`$, is not too noisy. Whereas the noisyness gets measured
by `noise_statistic`:
* `noise_statistic`$`(X_k) <`$ `noise_barrier`
by `noise_func`:
* `noise_func`$`(X_k) <`$ `noise_thresh`
### Noise Detection Functions
Currently two different noise detection functions are implemented:
- `"CoVar"`: Coefficient of Variation
- `"rVar"`: relative Variance
NOTE:
- The dataset is supposed to be harmonized to a timeseries with an equidistant frequency grid
- The derivative is calculated after applying a Savitsky-Golay filter to $`x`$
This function is a generalization of the Spectrum based Spike flagging
mechanism presented in [1]
This function is a generalization of the Spectrum based Spike flagging
mechanism presented in:
References:
[1] Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture
Data from the international Soil Moisture Network. 2013. Vadoze Zone J.
doi:10.2136/vzj2012.0097.
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture
Data from the international Soil Moisture Network. 2013. Vadoze Zone J.
doi:10.2136/vzj2012.0097.
[offset-strings]: docs/ParameterDescriptions.md#offset-strings
......@@ -20,28 +20,28 @@ from saqc.lib.tools import (
@register("spikes_slidingZscore")
def flagSpikes_slidingZscore(
data, field, flagger, window, offset, count=1, deg=1, z=3.5, method="modZ", **kwargs
data, field, flagger, window, offset, count=1, polydeg=1, z=3.5, method="modZ", **kwargs
):
""" A outlier detection in a sliding window. The method for detection can be a simple Z-score or the more robust
modified Z-score, as introduced here [1].
The steps are:
1. a window of size `winsz` is cut from the data
2. the data is fit by a polynomial of the given degree `deg`
1. a window of size `window` is cut from the data
2. the data is fit by a polynomial of the given degree `polydeg`
3. the outlier `method` detect potential outlier
4. the window is continued by `dx` to the next data-slot.
4. the window is continued by `offset` to the next data-slot.
5. processing continue at 1. until end of data.
6. all potential outlier, that are detected `count`-many times, are promoted to real outlier and flagged by the `flagger`
:param data: pandas dataframe. holding the data
:param field: fieldname in `data`, which holds the relevant infos
:param flagger: flagger.
:param winsz: int or time-offset string (see [2]). The size of the window the outlier detection is run in. default: 1h
:param dx: int or time-offset string (see [2]). Stepsize the window is set further. default: 1h
:param window: int or time-offset string (see [2]). The size of the window the outlier detection is run in. default: 1h
:param offset: int or time-offset string (see [2]). Stepsize the window is set further. default: 1h
:param method: str. `modZ` or `zscore`. see [1] at section `Z-Scores and Modified Z-Scores`
:param count: int. this many times, a datapoint needs to be detected in different windows, to be finally
flagged as outlier
:param deg: int. The degree for the polynomial fit, to calculate the residuum
:param polydeg: The degree for the polynomial fit, to calculate the residuum
:param z: float. the value the (mod.) Z-score is tested against. Defaulting to 3.5 (Recommendation of [1])
Links:
......@@ -60,12 +60,12 @@ def flagSpikes_slidingZscore(
winsz_s = offset2seconds(window)
else:
raise TypeError(
f"`winsz` and `dx` must both be an offset or both be numeric, {window} and {offset} was passed"
f"`window` and `offset` must both be an offset or both be numeric, {window} and {offset} was passed"
)
# check params
if deg < 0:
raise ValueError("deg must be positive")
if polydeg < 0:
raise ValueError("polydeg must be positive")
if z < 0:
raise ValueError("z must be positive")
if count <= 0:
......@@ -75,12 +75,12 @@ def flagSpikes_slidingZscore(
pass
elif dx_s >= winsz_s and count > 1:
ValueError(
"If stepsize `dx` is bigger that the window-size, every value is seen just once, so use count=1"
"If stepsize `offset` is bigger that the window-size, every value is seen just once, so use count=1"
)
elif count > winsz_s // dx_s:
raise ValueError(
f"Adjust `dx`, `stepsize` or `winsz`. A single data point is "
f"seen `floor(winsz / dx) = {winsz_s // dx_s}` times, but count is set to {count}"
f"Adjust `offset`, `stepsize` or `window`. A single data point is "
f"seen `floor(window / offset) = {winsz_s // dx_s}` times, but count is set to {count}"
)
# prepare the method
......@@ -127,7 +127,7 @@ def flagSpikes_slidingZscore(
continue
# get residual
coef = poly.polyfit(xchunk, ychunk, deg)
coef = poly.polyfit(xchunk, ychunk, polydeg)
model = poly.polyval(xchunk, coef)
residual = ychunk - model
......@@ -273,11 +273,11 @@ def flagSpikes_spektrumBased(
field,
flagger,
raise_factor=0.15,
dev_cont_factor=0.2,
noise_barrier=1,
deriv_factor=0.2,
noise_thresh=1,
noise_window="12h",
noise_statistic="CoVar",
smooth_poly_order=2,
noise_func="CoVar",
ploy_deg=2,
filter_window=None,
**kwargs,
):
......@@ -293,9 +293,9 @@ def flagSpikes_spektrumBased(
(1) the quotient to its preceeding datapoint exceeds a certain bound
(controlled by param "raise_factor")
(2) the quotient of the datas second derivate at the preceeding and subsequent timestamps is close enough to 1.
(controlled by param "dev_cont_factor")
(controlled by param "deriv_factor")
(3) the surrounding data is not too noisy. (Coefficient of Variation[+/- noise_window] < 1)
(controlled by param "noise_barrier")
(controlled by param "noise_thresh")
Some things you should be conscious about when applying this test:
......@@ -309,53 +309,50 @@ def flagSpikes_spektrumBased(
but can be switched to coefficient of variance, by assignment to parameter "noise statistic".
:param data: The pandas dataframe holding the data-to-be flagged.
Data must be indexed by a datetime series and be harmonized onto a
time raster with seconds precision.
:param field: Fieldname of the Soil moisture measurements field in data.
:param flagger: A flagger - object. (saqc.flagger.X)
:param filter_window_size: Offset string. Size of the filter window, used to calculate the derivatives.
(relevant only, if: diff_method='savgol')
:param smooth_poly_order: Integer. Polynomial order, used for smoothing with savitzk golay filter.
(relevant only, if: diff_method='savgol')
:param raise_factor: A float, determinating the bound, the quotient of two consecutive values
has to exceed, to be regarded as potentially spike. A value of 0.x will
trigger the spike test for value y_t, if:
(y_t)/(y_t-1) > 1 + x or:
(y_t)/(y_t-1) < 1 - x.
:param dev_cont_factor: A float, determining the interval, the quotient of the datas second derivate
around a potential spike has to be part of, to trigger spike flagging for
this value. A datapoint y_t will pass this spike condition if,
for dev_cont_factor = 0.x, and the second derivative y'' of y, the condition:
1 - x < abs((y''_t-1)/(y''_t+1)) < 1 + x
holds
:param noise_barrier: A float, determining the bound, the data noisy-ness around a potential spike
must not exceed, in order to guarantee a justifyed judgement:
Therefor the coefficient selected by parameter noise_statistic (COVA),
of all values within t +/- param "noise_window",
but excluding the point y_t itself, is evaluated and tested
for: COVA < noise_barrier.
:param noise_window_range: Offset string, determining the size of the window, the coefficient of
variation is calculated of, to determine data noisy-ness around a potential
spike.
The potential spike y_t will be centered in a window of expansion:
[y_t - noise_window_size, y_t + noise_window_size].
:param noise_statistic: STRING. Determines, wheather to use
"relative variance" or "coefficient of variation" to check against the noise
barrier.
'CoVar' -> "Coefficient of variation"
'rVar' -> "relative Variance"
:param data: The pandas dataframe holding the data-to-be flagged.
Data must be indexed by a datetime series and be harmonized onto a
time raster with seconds precision.
:param field: Fieldname of the Soil moisture measurements field in data.
:param flagger: A flagger - object. (saqc.flagger.X)
:param filter_window: Offset string. Size of the filter window, used to calculate the derivatives.
(relevant only, if: diff_method='savgol')
:param ploy_deg: Integer. Polynomial order, used for smoothing with savitzk golay filter.
(relevant only, if: diff_method='savgol')
:param raise_factor: A float, determinating the bound, the quotient of two consecutive values
has to exceed, to be regarded as potentially spike. A value of 0.x will
trigger the spike test for value y_t, if:
(y_t)/(y_t-1) > 1 + x or:
(y_t)/(y_t-1) < 1 - x.
:param deriv_factor: A float, determining the interval, the quotient of the datas second derivate
around a potential spike has to be part of, to trigger spike flagging for
this value. A datapoint y_t will pass this spike condition if,
for deriv_factor = 0.x, and the second derivative y'' of y, the condition:
1 - x < abs((y''_t-1)/(y''_t+1)) < 1 + x
holds
:param noise_thresh: A float, determining the bound, the data noisy-ness around a potential spike
must not exceed, in order to guarantee a justifyed judgement:
Therefor the coefficient selected by parameter noise_func (COVA),
of all values within t +/- param "noise_window",
but excluding the point y_t itself, is evaluated and tested
for: COVA < noise_thresh.
:param noise_window: Offset string, determining the size of the window, the coefficient of
variation is calculated of, to determine data noisy-ness around a potential
spike.
The potential spike y_t will be centered in a window of expansion:
[y_t - noise_window_size, y_t + noise_window_size].
:param noise_func: String. Determines, wheather to use
"relative variance" or "coefficient of variation" to check against the noise
barrier.
'CoVar' -> "Coefficient of variation"
'rVar' -> "relative Variance"
"""
dataseries, data_rate = retrieveTrustworthyOriginal(data, field, flagger)
# retrieve noise statistic
if noise_statistic == "CoVar":
noise_func = pd.Series.var
if noise_statistic == "rVar":
noise_func = pd.Series.std
noise_func_map = {
"covar": pd.Series.var,
"rvar": pd.Series.std
}
noise_func = noise_func_map[noise_func.lower()]
if filter_window is None:
filter_window = 3*pd.Timedelta(data_rate)
......@@ -376,8 +373,8 @@ def flagSpikes_spektrumBased(
filter_window_seconds = filter_window.seconds
smoothing_periods = int(np.ceil((filter_window_seconds / data_rate.n)))
lower_dev_bound = 1 - dev_cont_factor
upper_dev_bound = 1 + dev_cont_factor
lower_dev_bound = 1 - deriv_factor
upper_dev_bound = 1 + deriv_factor
if smoothing_periods % 2 == 0:
smoothing_periods += 1
......@@ -389,7 +386,7 @@ def flagSpikes_spektrumBased(
scnd_derivate = savgol_filter(
dataseries[start_slice:end_slice],
window_length=smoothing_periods,
polyorder=smooth_poly_order,
polyorder=ploy_deg,
deriv=2,
)
......@@ -405,7 +402,7 @@ def flagSpikes_spektrumBased(
test_slice = dataseries[start_slice:end_slice].drop(spike)
test_ratio_2 = np.abs(noise_func(test_slice) / test_slice.mean())
# not a spike, we want to flag, if condition not satisfied:
if test_ratio_2 > noise_barrier:
if test_ratio_2 > noise_thresh:
spikes[spike] = False
# not a spike, we want to flag, if condition not satisfied
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment