-
David Schäfer authoredcb3eab0f
Spike Detection
A collection of quality check routines to find spikes.
Index
spikes_basic
spikes_basic(thresh, tolerance, window_size)
parameter | data type | default value | description |
---|---|---|---|
thresh | float | Minimum difference between to values, to consider the latter one as a spike. See condition (1) | |
tolerance | float | Maximum difference between pre-spike and post-spike values. See condition (2) | |
window | offset string | Maximum length of "spikish" value courses. See condition (3) |
A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data.
The values
-
thresh
, -
tolerance
-
window
, with, denoting the series of timestamps associated with.
By this definition, spikes are values, that, after a jump of margin thresh
(1),
are keeping that new value level, for a timespan smaller than
window
(3), and then return to the initial value level -
within a tolerance of tolerance
(2).
NOTE: This characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses.
spikes_simpleMad
spikes_simpleMad(window="1h", z=3.5)
parameter | data type | default value | description |
---|---|---|---|
window | integer/offset string | "1h" |
size of the sliding window, where the modified Z-score is applied on |
z | float | 3.5 |
z-parameter of the modified Z-score |
This functions flags outliers using the simple median absolute deviation test.
Values are flagged if they fulfill the following condition within a sliding window:
where
The window is moved by one time stamp at a time.
NOTE: This function should only be applied on normalized data.
References: [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
spikes_slidingZscore
spikes_slidingZscore(window="1h", offset="1h", count=1, poly_deg=1, z=3.5, method="modZ")
parameter | data type | default value | description |
---|---|---|---|
window | integer/offset string | "1h" |
size of the sliding window |
offset | integer/offset string | "1h" |
offset between two consecutive windows |
count | integer | 1 |
the minimal count a possible outlier needs, to be flagged |
polydeg | integer | 1" |
the degree of the polynomial fit, to calculate the residual |
z | float | 3.5 |
z-parameter for the method (see description) |
method | string | "modZ" |
the method to detect outliers |
This functions flags spikes using the given method within sliding windows.
NOTE:
-
window
andoffset
must be of same type, mixing of offset- and integer- based windows is not supported and will fail - offset-strings only work with time-series-like data
The algorithm works as follows:
- a window of size
window
is cut from the data - normalization - the data is fit by a polynomial of the given degree
polydeg
, which is subtracted from the data - the outlier detection
method
is applied on the residual, possible outlier are marked - the window (on the data) is moved by
offset
- start over from 1. until the end of data is reached
- all potential outliers, that are detected
count
-many times, are flagged as outlier
Outlier Detection Methods
Currently two outlier detection methods are implemented:
-
"zscore"
: The Z-score marks every value as a possible outlier, which fulfills the follwing condition:where
denotes the residual,the residual mean,the residual standard deviation, andthe-parameter. -
"modZ"
: The modified Z-score Marks every value as a possible outlier, which fulfills the follwing condition:where
denotes the residual,the residual mean,the residual median absolute deviation, andthe-parameter.
References
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
spikes_spektrumBased
spikes_spektrumBased(raise_factor=0.15, deriv_factor=0.2,
noise_thresh=1, noise_window="12h", noise_func="CoVar",
ploy_deg=2, filter_window=None)
parameter | data type | default value | description |
---|---|---|---|
raise_factor | float | 0.15 |
Minimum relative value difference between two values to consider the latter as a spike candidate. See condition (1) |
deriv_factor | float | 0.2 |
See condition (2) |
noise_thresh | float | 1 |
Upper threshhold for noisyness of data surrounding potential spikes. See condition (3) |
noise_window | offset string | "12h" |
Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3) |
noise_func | string | "CoVar" |
Function to calculate noisyness of data, surrounding potential spikes |
ploy_deg | integer | 2 |
Order of the polynomial fit, applied with Savitsky-Golay-filter |
filter_window | offset string | None |
Controls the range of the smoothing window applied with the Savitsky-Golay filter. If None (default), the window size will be two times the sampling rate (thus, covering 3 values). If unsure, do not change that value |
The function flags spikes by evaluating the timeseries' derivatives and applying various conditions to them.
A datapoint
- The quotient to its preceeding datapoint exceeds a certain bound:
-
|\frac{x_k}{x_{k-1}}| > 1 +
raise_factor
, or -
|\frac{x_k}{x_{k-1}}| < 1 -
raise_factor
-
- The quotient of the data's second derivative x'', at the preceeding and subsequent timestamps is close enough to 1:
-
|\frac{x''_{k-1}}{x''_{k+1}} | > 1 -
deriv_factor
, and -
|\frac{x''_{k-1}}{x''_{k+1}} | < 1 +
deriv_factor
-
- The dataset, X_k, surroundingx_{k}, within
noise_window
range, but excludingx_{k}, is not too noisy. Whereas the noisyness gets measured bynoise_func
:-
noise_func
(X_k) <noise_thresh
-
NOTE:
-
The dataset is supposed to be harmonized to a timeseries with an equidistant frequency grid
-
The derivative is calculated after applying a Savitsky-Golay filter to
xThis function is a generalization of the Spectrum based Spike flagging mechanism presented in [1]
Noise Detection Functions
Currently two different noise detection functions are implemented:
-
"CoVar"
: Coefficient of Variation -
"rVar"
: relative Variance
References
[1] Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.