From 5500cd47055363db72d9602e4ac3233274c815f2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Peter=20L=C3=BCnenschlo=C3=9F?= <peter.luenenschloss@ufz.de> Date: Tue, 4 Aug 2020 08:08:56 +0200 Subject: [PATCH] Delete SpikeDetection.md --- docs/funcs/SpikeDetection.md | 237 ----------------------------------- 1 file changed, 237 deletions(-) delete mode 100644 docs/funcs/SpikeDetection.md diff --git a/docs/funcs/SpikeDetection.md b/docs/funcs/SpikeDetection.md deleted file mode 100644 index 006b30b26..000000000 --- a/docs/funcs/SpikeDetection.md +++ /dev/null @@ -1,237 +0,0 @@ -# Spike Detection - -A collection of quality check routines to find spikes. - -## Index - -- [spikes_flagBasic](#spikes_flagbasic) -- [spikes_flagMad](#spikes_flagmad) -- [spikes_flagSlidingZscore](#spikes_flagslidingzscore) -- [spikes_flagSpektrumBased](#spikes_flagspektrumbased) -- [spikes_flagRaise](#spikes_flagraise) - - -## spikes_flagBasic - -``` -spikes_flagBasic(thresh, tolerance, window) -``` - -| parameter | data type | default value | description | -|-----------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------| -| thresh | float | | Minimum difference between to values, to consider the latter one as a spike. See condition (1) | -| tolerance | float | | Maximum difference between pre-spike and post-spike values. See condition (2) | -| window | [offset string](docs/ParameterDescriptions.md#offset-strings) | | Maximum length of "spiky" value courses. See condition (3) | - -A basic outlier test, that is designed to work for harmonized, as well as raw -(not-harmonized) data. - -The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a time series $`x_t`$ with -timestamps $`t_i`$ are considered spikes, if: - -1. $`|x_{n-1} - x_{n+s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$ - -2. $`|x_{n-1} - x_{n+k+1}| < `$ `tolerance` - -3. $` |t_{n-1} - t_{n+k+1}| < `$ `window` - -By this definition, spikes are values, that, after a jump of margin `thresh`(1), -are keeping that new value level, for a time span smaller than -`window` (3), and then return to the initial value level - -within a tolerance of `tolerance` (2). - -NOTE: -This characterization of a "spike", not only includes one-value -outliers, but also plateau-ish value courses. - - -## spikes_flagMad - -``` -spikes_flagMad(window, z=3.5) -``` - -| parameter | data type | default value | description | -|-----------|-----------------------------------------------------------------------|---------------|----------------------------------------------------------------------| -| window | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | | size of the sliding window, where the modified Z-score is applied on | -| z | float | `3.5` | z-parameter of the modified Z-score | - -This functions flags outliers using the simple median absolute deviation test. - -Values are flagged if they fulfill the following condition within a sliding window: - -```math - 0.6745 * |x - m| > mad * z > 0 -``` - -where $`x`$ denotes the window data, $`m`$ the window median, $`mad`$ the median -absolute deviation and $`z`$ the $`z`$-parameter of the modified Z-Score. - -The window is moved by one time stamp at a time. - -NOTE: -This function should only be applied on normalized data. - -References: -[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - - -## spikes_flagSlidingZscore - -``` -spikes_flagSlidingZscore(window, offset, count=1, polydeg=1, z=3.5, method="modZ") -``` - -| parameter | data type | default value | description | -|-----------|-----------------------------------------------------------------------|---------------|-------------------------------------------------------------| -| window | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | | size of the sliding window | -| offset | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) | | offset between two consecutive windows | -| count | integer | `1` | the minimal count a possible outlier needs, to be flagged | -| polydeg | integer | `1"` | the degree of the polynomial fit, to calculate the residual | -| z | float | `3.5` | z-parameter for the *method* (see description) | -| method | [string](#outlier-detection-methods) | `"modZ"` | the method to detect outliers | - -This functions flags spikes using the given method within sliding windows. - -NOTE: - - `window` and `offset` must be of same type, mixing of offset- and integer- - based windows is not supported and will fail - - offset-strings only work with time-series-like data - -The algorithm works as follows: - 1. a window of size `window` is cut from the data - 2. normalization - the data is fit by a polynomial of the given degree `polydeg`, which is subtracted from the data - 3. the outlier detection `method` is applied on the residual, possible outlier are marked - 4. the window (on the data) is moved by `offset` - 5. start over from 1. until the end of data is reached - 6. all potential outliers, that are detected `count`-many times, are flagged as outlier - -### Outlier Detection Methods -Currently two outlier detection methods are implemented: - -1. `"zscore"`: The Z-score marks every value as a possible outlier, which fulfills the following condition: - - ```math - |r - m| > s * z - ``` - where $`r`$ denotes the residual, $`m`$ the residual mean, $`s`$ the residual - standard deviation, and $`z`$ the $`z`$-parameter. - -2. `"modZ"`: The modified Z-score Marks every value as a possible outlier, which fulfills the following condition: - - ```math - 0.6745 * |r - m| > mad * z > 0 - ``` - - where $`r`$ denotes the residual, $`m`$ the residual mean, $`mad`$ the residual median absolute - deviation, and $`z`$ the $`z`$-parameter. - -### References -[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - - -## spikes_flagSpektrumBased - -``` -spikes_flagSpektrumBased(raise_factor=0.15, deriv_factor=0.2, - noise_func="CoVar", noise_window="12h", noise_thresh=1, - smooth_window=None, smooth_poly_deg=2) -``` - -| parameter | data type | default value | description | -|-----------------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------| -| raise_factor | float | `0.15` | Minimum relative value difference between two values to consider the latter as a spike candidate. See condition (1) | -| deriv_factor | float | `0.2` | See condition (2) | -| noise_func | [string](#noise-detection-functions) | `"CoVar"` | Function to calculate noisiness of the data surrounding potential spikes | -| noise_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | `"12h"` | Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3) | -| noise_thresh | float | `1` | Upper threshold for noisiness of data surrounding potential spikes. See condition (3) | -| smooth_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | `None` | Size of the smoothing window of the Savitsky-Golay filter. The default value `None` results in a window of two times the sampling rate (i.e. three values) | -| smooth_poly_deg | integer | `2` | Degree of the polynomial used for fitting with the Savitsky-Golay filter | - - -The function flags spikes by evaluating the time series' derivatives -and applying various conditions to them. - -The value $`x_{k}`$ of a time series $`x_t`$ with -timestamps $`t_i`$ is considered a spikes, if: - - -1. The quotient to its preceding data point exceeds a certain bound: - * $` |\frac{x_k}{x_{k-1}}| > 1 + `$ `raise_factor`, or - * $` |\frac{x_k}{x_{k-1}}| < 1 - `$ `raise_factor` -2. The quotient of the second derivative $`x''`$, at the preceding - and subsequent timestamps is close enough to 1: - * $` |\frac{x''_{k-1}}{x''_{k+1}} | > 1 - `$ `deriv_factor`, and - * $` |\frac{x''_{k-1}}{x''_{k+1}} | < 1 + `$ `deriv_factor` -3. The dataset $`X = x_i, ..., x_{k-1}, x_{k+1}, ..., x_j`$, with - $`|t_{k-1} - t_i| = |t_j - t_{k+1}| =`$ `noise_window` fulfills the - following condition: - `noise_func`$`(X) <`$ `noise_thresh` - -NOTE: -- The dataset is supposed to be harmonized to a time series with an equidistant frequency grid -- The derivative is calculated after applying a Savitsky-Golay filter to $`x`$ - - This function is a generalization of the Spectrum based Spike flagging - mechanism presented in [1] - -### Noise Detection Functions -Currently two different noise detection functions are implemented: -- `"CoVar"`: Coefficient of Variation -- `"rVar"`: relative Variance - - -### References -[1] Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture - Data from the international Soil Moisture Network. 2013. Vadoze Zone J. - doi:10.2136/vzj2012.0097. - -## spikes_flagRaise - - -``` -spikes_flagRaise(thresh, raise_window, intended_freq, average_window=None, - mean_raise_factor=2, min_slope=None, min_slope_weight=0.8, - numba_boost=True) -``` - -| parameter | data type | default value | description | -|-------------------|---------------------------------------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| thresh | float | | The threshold, for the total rise (`thresh` $` > 0 `$ ), or total drop (`thresh` $` < 0 `$ ), that value courses must not exceed within a timespan of length `raise_window` | -| raise_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | | The timespan, the rise/drop thresholding refers to. Window is inclusively defined. | -| intended_freq | [offset string](docs/ParameterDescriptions.md#offset-strings) | | The frequency, timeseries to-be-flagged is supposed to be sampled at. Window is inclusively defined. | -| average_window | [offset string](docs/ParameterDescriptions.md#offset-strings) | `None` | See condition (2) below. Window is inclusively defined. The window defaults to 1.5 times the size of `raise_window` | -| mean_raise_factor | float | `2` | See condition (2) below. | -| min_slope | float | `None` | See condition (3) | -| min_slope_weight | integer | `0.8` | See condition (3) | - -The function flags rises and drops in value courses, that exceed the threshold -given by `thresh` within a timespan shorter than, or equalling the time window -given by `raise_window`. - -Weather rises or drops are flagged, is controlled by the signum of `thresh`. -(positive->rises, negative->drops) - -The parameter variety of the function is owned to the intriguing -case of values, that "return" from outlierish or anomalious value levels and -thus exceed the threshold, while actually being usual values. - -The value $`x_{k}`$ of a time series $`x`$ with associated -timestamps $`t_i`$, is flagged a rise, if: - -1. There is any value $`x_{s}`$, preceeding $`x_{k}`$ within `raise_window` range, so that: - * $` M = |x_k - x_s | > `$ `thresh` $` > 0`$ -2. The weighted average $`\mu^*`$ of the values, preceeding $`x_{k}`$ within `average_window` range indicates, that $`x_{k}`$ doesnt return from an outliererish value course, meaning that: - * $` x_k > \mu^* + ( M `$ / `mean_raise_factor` $`)`$ -3. Additionally, if `min_slope` is not `None`, $`x_{k}`$ is checked for being sufficiently divergent from its very predecessor $`x_{k-1}`$, meaning that, it is additionally checked if: - * $`x_k - x_{k-1} > `$ `min_slope` - * $`t_k - t_{k-1} > `$ `min_slope_weight`*`intended_freq` - -The weighted average $`\mu^*`$ was calculated with weights $`w_{i}`$, defined by: -* $`w_{i} = (t_i - t_{i-1})`$ / `intended_freq`, if $`(t_i - t_{i-1})`$ < `intended_freq` and $`w_i =1`$ otherwise. - -The application of time gap weights and a slope weights are to account for the case of not harmonized timeseries. - -NOTE: -- The dataset is NOT supposed to be harmonized to a time series with an - equidistant frequency grid -- GitLab