-
David Schäfer authoredaf74dfe6
Spike Detection
A collection of quality check routines to find spikes.
Index
spikes_basic
spikes_basic(thresh, tolerance, window_size)
parameter | data type | default value | description |
---|---|---|---|
thresh | float | Minimum difference between to values, to consider the latter one as a spike. See condition (1) | |
tolerance | float | Maximum difference between pre-spike and post-spike values. See condition (2) | |
window | offset string | Maximum length of "spiky" value courses. See condition (3) |
A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data.
The values
-
|x_{n-1} - x_{n+s}| >
thresh
,s \in \{0,1,2,...,k\} -
|x_{n-1} - x_{n+k+1}| <
tolerance
-
|t_{n-1} - t_{n+k+1}| <
window
By this definition, spikes are values, that, after a jump of margin thresh
(1),
are keeping that new value level, for a time span smaller than
window
(3), and then return to the initial value level -
within a tolerance of tolerance
(2).
NOTE: This characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses.
spikes_simpleMad
spikes_simpleMad(window="1h", z=3.5)
parameter | data type | default value | description |
---|---|---|---|
window | integer/offset string | "1h" |
size of the sliding window, where the modified Z-score is applied on |
z | float | 3.5 |
z-parameter of the modified Z-score |
This functions flags outliers using the simple median absolute deviation test.
Values are flagged if they fulfill the following condition within a sliding window:
where
The window is moved by one time stamp at a time.
NOTE: This function should only be applied on normalized data.
References: [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
spikes_slidingZscore
spikes_slidingZscore(window="1h", offset="1h", count=1, polydeg=1, z=3.5, method="modZ")
parameter | data type | default value | description |
---|---|---|---|
window | integer/offset string | "1h" |
size of the sliding window |
offset | integer/offset string | "1h" |
offset between two consecutive windows |
count | integer | 1 |
the minimal count a possible outlier needs, to be flagged |
polydeg | integer | 1" |
the degree of the polynomial fit, to calculate the residual |
z | float | 3.5 |
z-parameter for the method (see description) |
method | string | "modZ" |
the method to detect outliers |
This functions flags spikes using the given method within sliding windows.
NOTE:
-
window
andoffset
must be of same type, mixing of offset- and integer- based windows is not supported and will fail - offset-strings only work with time-series-like data
The algorithm works as follows:
- a window of size
window
is cut from the data - normalization - the data is fit by a polynomial of the given degree
polydeg
, which is subtracted from the data - the outlier detection
method
is applied on the residual, possible outlier are marked - the window (on the data) is moved by
offset
- start over from 1. until the end of data is reached
- all potential outliers, that are detected
count
-many times, are flagged as outlier
Outlier Detection Methods
Currently two outlier detection methods are implemented:
-
"zscore"
: The Z-score marks every value as a possible outlier, which fulfills the following condition:|r - m| > s * zwhere
rdenotes the residual,mthe residual mean,sthe residual standard deviation, andzthez-parameter. -
"modZ"
: The modified Z-score Marks every value as a possible outlier, which fulfills the following condition:0.6745 * |r - m| > mad * z > 0where
rdenotes the residual,mthe residual mean,madthe residual median absolute deviation, andzthez-parameter.
References
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
spikes_spektrumBased
spikes_spektrumBased(raise_factor=0.15, deriv_factor=0.2,
noise_func="CoVar", noise_window="12h", noise_thresh=1,
smooth_window=None, smooth_poly_deg=2)
parameter | data type | default value | description |
---|---|---|---|
raise_factor | float | 0.15 |
Minimum relative value difference between two values to consider the latter as a spike candidate. See condition (1) |
deriv_factor | float | 0.2 |
See condition (2) |
noise_func | string | "CoVar" |
Function to calculate noisiness of the data surrounding potential spikes |
noise_window | offset string | "12h" |
Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3) |
noise_thresh | float | 1 |
Upper threshold for noisiness of data surrounding potential spikes. See condition (3) |
smooth_window | offset string | None |
Size of the smoothing window of the Savitsky-Golay filter. The default value None results in a window of two times the sampling rate (i.e. three values) |
smooth_poly_deg | integer | 2 |
Degree of the polynomial used for fitting with the Savitsky-Golay filter |
The function flags spikes by evaluating the time series' derivatives and applying various conditions to them.
The value
- The quotient to its preceding data point exceeds a certain bound:
-
|\frac{x_k}{x_{k-1}}| > 1 +
raise_factor
, or -
|\frac{x_k}{x_{k-1}}| < 1 -
raise_factor
-
- The quotient of the second derivative x'', at the preceding and subsequent timestamps is close enough to 1:
-
|\frac{x''_{k-1}}{x''_{k+1}} | > 1 -
deriv_factor
, and -
|\frac{x''_{k-1}}{x''_{k+1}} | < 1 +
deriv_factor
-
- The dataset X = x_i, ..., x_{k-1}, x_{k+1}, ..., x_j, with|t_{k-1} - t_i| = |t_j - t_{k+1}| =
noise_window
fulfills the following condition:noise_func
(X) <noise_thresh
NOTE:
-
The dataset is supposed to be harmonized to a time series with an equidistant frequency grid
-
The derivative is calculated after applying a Savitsky-Golay filter to
xThis function is a generalization of the Spectrum based Spike flagging mechanism presented in [1]
Noise Detection Functions
Currently two different noise detection functions are implemented:
-
"CoVar"
: Coefficient of Variation -
"rVar"
: relative Variance
References
[1] Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.