SpikeDetection.md

# Spike Detection

A collection of quality check routines to find spikes.

## Index

- [spikes_basic](#spikes_basic)
- [spikes_simpleMad](#spikes_simplemad)
- [spikes_slidingZscore](#spikes_slidingzscore)
- [spikes_spektrumBased](#spikes_spektrumbased)


## spikes_basic

```
spikes_basic(thresh, tolerance, window)
```

| parameter | data type                                                     | default value | description                                                                                    |
|-----------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------|
| thresh    | float                                                         |               | Minimum difference between to values, to consider the latter one as a spike. See condition (1) |
| tolerance | float                                                         |               | Maximum difference between pre-spike and post-spike values. See condition (2)                  |
| window    | [offset string](docs/ParameterDescriptions.md#offset-strings) |               | Maximum length of "spiky" value courses. See condition (3)                                  |

A basic outlier test, that is designed to work for harmonized, as well as raw
(not-harmonized) data.

The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a time series $`x_t`$ with 
timestamps $`t_i`$ are considered spikes, if:

1. $`|x_{n-1} - x_{n+s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$

2. $`|x_{n-1} - x_{n+k+1}| < `$ `tolerance`

3. $` |t_{n-1} - t_{n+k+1}| < `$ `window`

By this definition, spikes are values, that, after a jump of margin `thresh`(1),
are keeping that new value level, for a time span smaller than
`window` (3), and then return to the initial value level -
within a tolerance of `tolerance` (2).

NOTE:
This characterization of a "spike", not only includes one-value
outliers, but also plateau-ish value courses.


## spikes_simpleMad

```
spikes_simpleMad(window, z=3.5)
```

| parameter | data type                                                             | default value | description                                                          |
|-----------|-----------------------------------------------------------------------|---------------|----------------------------------------------------------------------|
| window    | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) |         | size of the sliding window, where the modified Z-score is applied on |
| z         | float                                                                 | `3.5`         | z-parameter of the modified Z-score                                  |

This functions flags outliers using the simple median absolute deviation test.

Values are flagged if they fulfill the following condition within a sliding window:

```math
 0.6745 * |x - m| > mad * z > 0
```

where $`x`$ denotes the window data, $`m`$ the window median, $`mad`$ the median
absolute deviation and $`z`$ the $`z`$-parameter of the modified Z-Score.

The window is moved by one time stamp at a time.

NOTE:
This function should only be applied on normalized data.

References:
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm


## spikes_slidingZscore

```
spikes_slidingZscore(window, offset, count=1, polydeg=1, z=3.5, method="modZ")
```

| parameter | data type                                                             | default value | description                                                 |
|-----------|-----------------------------------------------------------------------|---------------|-------------------------------------------------------------|
| window    | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) |               | size of the sliding window                                  |
| offset    | integer/[offset string](docs/ParameterDescriptions.md#offset-strings) |               | offset between two consecutive windows                      |
| count     | integer                                                               | `1`           | the minimal count a possible outlier needs, to be flagged   |
| polydeg   | integer                                                               | `1"`          | the degree of the polynomial fit, to calculate the residual |
| z         | float                                                                 | `3.5`         | z-parameter for the *method* (see description)              |
| method    | [string](#outlier-detection-methods)                                  | `"modZ"`      | the method to detect outliers                               |

This functions flags spikes using the given method within sliding windows.

NOTE:
 - `window` and `offset` must be of same type, mixing of offset- and integer-
    based windows is not supported and will fail
 - offset-strings only work with time-series-like data

The algorithm works as follows:
  1.  a window of size `window` is cut from the data
  2.  normalization - the data is fit by a polynomial of the given degree `polydeg`, which is subtracted from the data
  3.  the outlier detection `method` is applied on the residual, possible outlier are marked
  4.  the window (on the data) is moved by `offset`
  5.  start over from 1. until the end of data is reached
  6.  all potential outliers, that are detected `count`-many times, are flagged as outlier

### Outlier Detection Methods
Currently two outlier detection methods are implemented:

1. `"zscore"`: The Z-score marks every value as a possible outlier, which fulfills the following condition:

   ```math
    |r - m| > s * z
   ```
   where $`r`$ denotes the residual, $`m`$ the residual mean, $`s`$ the residual
   standard deviation, and $`z`$ the $`z`$-parameter.

2. `"modZ"`: The modified Z-score Marks every value as a possible outlier, which fulfills the following condition:

   ```math
    0.6745 * |r - m| > mad * z > 0
   ```

   where $`r`$ denotes the residual, $`m`$ the residual mean, $`mad`$ the residual median absolute
   deviation, and $`z`$ the $`z`$-parameter.

### References
[1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm


## spikes_spektrumBased

```
spikes_spektrumBased(raise_factor=0.15, deriv_factor=0.2,
                     noise_func="CoVar", noise_window="12h", noise_thresh=1, 
                     smooth_window=None, smooth_poly_deg=2)
```

| parameter       | data type                                                     | default value | description                                                                                                                                                |
|-----------------|---------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| raise_factor    | float                                                         | `0.15`        | Minimum relative value difference between two values to consider the latter as a spike candidate. See condition (1)                                        |
| deriv_factor    | float                                                         | `0.2`         | See condition (2)                                                                                                                                          |
| noise_func      | [string](#noise-detection-functions)                          | `"CoVar"`     | Function to calculate noisiness of the data surrounding potential spikes                                                                                  |
| noise_window    | [offset string](docs/ParameterDescriptions.md#offset-strings) | `"12h"`       | Determines the range of the time window of the "surrounding" data of a potential spike. See condition (3)                                                  |
| noise_thresh    | float                                                         | `1`           | Upper threshold for noisiness of data surrounding potential spikes. See condition (3)                                                                      |
| smooth_window   | [offset string](docs/ParameterDescriptions.md#offset-strings) | `None`        | Size of the smoothing window of the Savitsky-Golay filter. The default value `None` results in a window of two times the sampling rate (i.e. three values) |
| smooth_poly_deg | integer                                                       | `2`           | Degree of the polynomial used for fitting with the Savitsky-Golay filter                                                                                   |


The function flags spikes by evaluating the time series' derivatives
and applying various conditions to them.

The value $`x_{k}`$ of a time series $`x_t`$ with 
timestamps $`t_i`$ is considered a spikes, if:


1. The quotient to its preceding data point exceeds a certain bound:
    * $` |\frac{x_k}{x_{k-1}}| > 1 + `$ `raise_factor`, or
    * $` |\frac{x_k}{x_{k-1}}| < 1 - `$ `raise_factor`
2. The quotient of the second derivative $`x''`$, at the preceding
   and subsequent timestamps is close enough to 1:
    * $` |\frac{x''_{k-1}}{x''_{k+1}} | > 1 - `$ `deriv_factor`, and
    * $` |\frac{x''_{k-1}}{x''_{k+1}} | < 1 + `$ `deriv_factor`
3. The dataset $`X = x_i, ..., x_{k-1}, x_{k+1}, ..., x_j`$, with 
   $`|t_{k-1} - t_i| = |t_j - t_{k+1}| =`$ `noise_window` fulfills the 
   following condition: 
   `noise_func`$`(X) <`$ `noise_thresh`
   
NOTE:
- The dataset is supposed to be harmonized to a time series with an equidistant frequency grid
- The derivative is calculated after applying a Savitsky-Golay filter to $`x`$

  This function is a generalization of the Spectrum based Spike flagging
  mechanism presented in [1]

### Noise Detection Functions
Currently two different noise detection functions are implemented:
- `"CoVar"`: Coefficient of Variation
- `"rVar"`: relative Variance


### References
[1] Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture
    Data from the international Soil Moisture Network. 2013. Vadoze Zone J.
    doi:10.2136/vzj2012.0097.