From 3cf8dda6ead653ae9de66fcb119727d027e81844 Mon Sep 17 00:00:00 2001 From: Peter Luenenschloss <peter.luenenschloss@ufz.de> Date: Wed, 27 Nov 2019 08:52:36 +0100 Subject: [PATCH] Update FunctionDescriptions.md (formula typesetting, parameter tabs) --- docs/FunctionDescriptions.md | 66 +++++++++++++++++++++++------------- 1 file changed, 43 insertions(+), 23 deletions(-) diff --git a/docs/FunctionDescriptions.md b/docs/FunctionDescriptions.md index 0bff41158..9aa360503 100644 --- a/docs/FunctionDescriptions.md +++ b/docs/FunctionDescriptions.md @@ -19,6 +19,11 @@ associated with "missing" data. The missing data indicator (`np.nan` by default) , can be altered to any other value by passing this new value to the parameter `nodata`. +| parameter | description | +| ------ | ------ | +| nodata | Value. (Default = np.nan). Any value, that shall indicate missing data in the passed dataseries. | + + ## sesonalRange ### Signature ``` @@ -61,33 +66,38 @@ mad(length, z=3.5, freq=None) ## Spikes_Basic ### Signature ``` -Spikes_Basic(thresh=7, tol=0, length="15min") +Spikes_Basic(thresh, tolerance, window_size) ``` ### Description A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data. -The values x(n), x(n+1), .... , x(n+k) of a passed timeseries x, are considered -spikes, if: +The values $`x_{n}, x_{n+1}, .... , x_{n+k} `$ of a passed timeseries $`x`$, +are considered spikes, if: -1. |x(n-1) - x(n + s)| > `thresh`, for all integers s in {0,1,2,...,k} +1. $`|x_{n-1} - x_{n + s}| > `$ `thresh`, $` s \in \{0,1,2,...,k\} `$ -2. |x(n-1) - x(n+k+1)| < `tol` +2. $`|x_{n-1} - x_{n+k+1}| < `$ `tolerance` -3. |x(n-1).index - x(n+k+1).index| < `length` +3. $` |y_{n-1} - y_{n+k+1}| < `$ `window_size`, with $`y `$, denoting the series + of timestamps associated with $`x `$. By this definition, spikes are values, that, after a jump of margin `thresh`(1), are keeping that new value level they jumped to, for a timespan smaller than -`length` (3), and do then return to the initial value level - -within a tolerance margin of `tol` (2). +`window_size` (3), and do then return to the initial value level - +within a tolerance margin of `tolerance` (2). + Note, that this characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses. The implementation is a time-window based version of an outlier test from the -UFZ Python library, that can be found here: - -https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py +UFZ Python library, that can be found [here](https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py). +| parameter | description | +| ------ | ------ | +| thresh | Float. <br/> Minimum jump margin for spikes. See condition (1). | +| tolerance | Float. <br/> Range of area, containing al "valid return values". See condition (2). | +| window_size | Offset String. <br/> An offset string, denoting the maximal length of "spikish" value courses. See condition (3). | ## Spikes_SpektrumBased ### Signature @@ -99,25 +109,25 @@ Spikes_SpektrumBased(filter_window_size="3h", raise_factor=0.15, dev_cont_factor ### Description The function detects and flags spikes in input data series by evaluating the -the timeseries' derivatives and applying some conditions to it. +the timeseries' derivatives and applying some conditions to them. NOTE, that the dataseries-to-be flagged is supposed to be harmonized to an equadistant frequencie grid. -A datapoint x(k) of a dataseries x, is considered a spike, if: +A datapoint $`x_k `$ of a dataseries $`x`$, +is considered a spike, if: 1. The quotient to its preceeding datapoint exceeds a certain bound: - * x(k)/x(k-1) > 1 + `raise_factor`, or: - * x(k)/x(k-1) < 1 - `raise_factor` -2. The quotient of the datas second derivate x'', at the preceeding + * $`|\frac{x_k}{x_{k-1}}| > 1 +`$ `raise_factor`, or: + * $`|\frac{x_k}{x_{k-1}}| < 1 -`$ `raise_factor` +2. The quotient of the datas second derivate $`x''`$, at the preceeding and subsequent timestamps is close enough to 1: - * (1 - `dev_cont_factor`) < | x''(k-1)/x''(k+1) |, and - * (1 + `dev_cont_factor`) > | x''(k-1)/x''(k+1) | -3. The dataset, surrounding x(k), within `noise_window_size` range, but excluding - x(k), is not too noisy. Wheras the noisyness gets measured by - `noise_statistic`: - * 'noise_statistic'(x.index(k-'noise_window_size'),..., - x.index(k+'noise_window') < `noise_barrier` + * $`|\frac{x''_{k-1}}{x''_{k+1}} | > 1 -`$ `dev_cont_factor`, and + * $`|\frac{x''_{k-1}}{x''_{k+1}} | < 1 +`$ `dev_cont_factor` +3. The dataset, $`X_k`$, surrounding $`x_{k}`$, within `noise_window_size` range, + but excluding $`x_{k}`$, is not too noisy. Wheras the noisyness gets measured + by `noise_statistic`: + * `noise_statistic`$`(X_k) <`$ `noise_barrier` This Function is a generalization of the Spectrum based Spike flagging @@ -127,6 +137,16 @@ Dorigo,W,.... Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097. +All parameters default to the values given there. + +| parameter | description | +| ------ | ------ | +| raise_factor | Float. (Default=0.15). <br/> Minimum change margin for a datapoint to become a candidate for a spike. See condition (1). | +| dev_cont_factor | Float. (Default=0.2). <br/> See condition (2). | +| noise_barrier| Float. (Default=1). <br/> Upper bound for noisyness of data surrounding potential spikes. See condition (3).| +| noise_window_size| Offset String. (Default='12h'). <br/> Size of the timewindow of the "surrounding" data of a potential spike. See condition (3). | +| noise_statistic| String. (Default="CoVar"). <br/> Operator to calculate noisyness of data, surrounding potential spike. Either "Covar" (=Coefficient od Variation) or "rvar" (=relative Variance).| + ## constant ### Signature ``` -- GitLab