Snippets Groups Projects

UFZ GitLab and Mattermost will be OFFLINE on March 27 from 8:00pm to 10pm due to a system migration!

fixed the constants test · a76297d2
David Schäfer authored 5 years ago

a76297d2

FunctionDescriptions.md 64.34 KiB

Implemented QC functions

Main documentation of the implemented functions, their purpose and parameters and their description.

Index

Miscellaneous
- range
- seasonalRange
- isolated
- missing
- clear
- force
Spike Detection
Constant Detection
- constant
- constants_varianceBased
Break Detection
- breaks_spektrumBased
Time Series Harmonization
Soil Moisture
Machine Learning
- machinelearning

Miscellaneous

range

range(min, max)

parameter	data type	default value	description
min	float		upper bound for valid values
max	float		lower bound for valid values

The function flags all values, that exceed the closed interval [min, max].

seasonalRange

sesonalRange(min, max, startmonth=1, endmonth=12, startday=1, endday=31)

parameter	data type	default value	description
min	float		upper bound for valid values
max	float		lower bound for valid values
startmonth	integer	`1`	interval start month
endmonth	integer	`12`	interval end month
startday	integer	`1`	interval start day
endday	integer	`31`	interval end day

The function does the same as range (flags all data, that exceed the interval [min, max]), but only, if the timestamp of the data-point lies in a time interval defined by day and month only. The year is not used by the interval calculation. The left interval boundary is defined by startmonth and startday, the right by endmonth and endday. Both boundaries are inclusive. If the left side occurs later in the year than the right side, the interval is extended over the change of year (e.g. an interval of [01/12, 01/03], will flag values in december, january and february).

Note: Only works for datetime indexed data

isolated

isolated(window, group_size=1, continuation_range='1min')

parameter	data type	default value	description
window	offset string		The range, within there are no valid values allowed for a valuegroup to get flagged isolated. See condition (1) and (2).
group_size	integer	`1`	The upper bound for the size of a value group to be considered an isolated group. See condition (3).
continuation_range	offset string	`"1min"`	The upper bound for the temporal extension of a value group to be considered an isolated group. See condition (4). Only relevant if `group_size` > 1.

The function flags isolated values / value groups. Isolated values are values / value groups, that, in a range of window, are surrounded either by already flagged or missing values only.

The function defaults to flag isolated single values only. But the parameters allow for detections of more complex isolation definitions, including groups of isolated values.

A continuous group of timeseries values x_{k}, x_{k+1},...,x_{k+n} is considered to be "isolated", if:

There are no values, preceeding x_{k} within window or all the preceeding values within this range are flagged
There are no values, succeeding x_{k+n}, within window, or all the succeeding values within this range are flagged
n \leq group_size
|y_{k} - y_{n+k}| < continuation_range, with y , denoting the series of timestamps associated with x.

missing

missing(nodata=NaN)

parameter	data type	default value	description
nodata	any	`NAN`	Value indicating missing values in the passed data.

The function flags those values in the the passed data series, that are associated with "missing" data. The missing data indicator (default: NAN), can be altered to any other value by passing this value to the parameter nodata.

clear

clear()

Remove all previously set flags.

force

force(flag)

parameter	data type	default value	description
flag	float/GOOD/BAD/UNFLAGGED	GOOD	flag to force

Force flags to the given flag value.

Spike Detection

spikes_basic

spikes_basic(thresh, tolerance, window_size)

parameter	data type	default value	description
thresh	float		Minimum jump margin for spikes. See condition (1).
tolerance	float		Range of area, containing all "valid return values". See condition (2).
window_size	string		An offset string, denoting the maximum length of "spikish" value courses. See condition (3).

A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data.

The values x_{n}, x_{n+1}, .... , x_{n+k} of a passed timeseries x, are considered spikes, if:

|x_{n-1} - x_{n + s}| > thresh, s \in \{0,1,2,...,k\}
|x_{n-1} - x_{n+k+1}| < tolerance
|y_{n-1} - y_{n+k+1}| < window_size, with y , denoting the series of timestamps associated with x .

By this definition, spikes are values, that, after a jump of margin thresh(1), are keeping that new value level they jumped to, for a timespan smaller than window_size (3), and do then return to the initial value level - within a tolerance margin of tolerance (2).

Note, that this characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses.

spikes_simpleMad

Flag outlier by simple median absolute deviation test.

spikes_simpleMad(winsz="1h", z=3.5)

parameter	data type	default value	description
winsz	offset-string or int	`"1h"`	size of the sliding window, where the modified Z-score is applied on
z	float	`3.5`	z-parameter the modified Z-score

The modified Z-score [1] is used to detect outlier. All values are flagged as outlier, if in any slice of the sliding window, a value fulfills:

 0.6745 * |x - M| > mad * z > 0

with x, M, mad, z: window data, window median, window median absolute deviation, z. The window is moved by one frequency step.

Note: This function should only be applied on normalized data.

See also: [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm