Skip to content
Snippets Groups Projects
FunctionDescriptions.md 64.34 KiB

Implemented QC functions

Main documentation of the implemented functions, their purpose and parameters and their description.

Index

Miscellaneous

range

range(min, max)
parameter data type default value description
min float upper bound for valid values
max float lower bound for valid values

The function flags all values, that exceed the closed interval [min, max].

seasonalRange

sesonalRange(min, max, startmonth=1, endmonth=12, startday=1, endday=31)
parameter data type default value description
min float upper bound for valid values
max float lower bound for valid values
startmonth integer 1 interval start month
endmonth integer 12 interval end month
startday integer 1 interval start day
endday integer 31 interval end day

The function does the same as range (flags all data, that exceed the interval [min, max]), but only, if the timestamp of the data-point lies in a time interval defined by day and month only. The year is not used by the interval calculation. The left interval boundary is defined by startmonth and startday, the right by endmonth and endday. Both boundaries are inclusive. If the left side occurs later in the year than the right side, the interval is extended over the change of year (e.g. an interval of [01/12, 01/03], will flag values in december, january and february).

Note: Only works for datetime indexed data

isolated

isolated(window, group_size=1, continuation_range='1min') 
parameter data type default value description
window offset string The range, within there are no valid values allowed for a valuegroup to get flagged isolated. See condition (1) and (2).
group_size integer 1 The upper bound for the size of a value group to be considered an isolated group. See condition (3).
continuation_range offset string "1min" The upper bound for the temporal extension of a value group to be considered an isolated group. See condition (4). Only relevant if group_size > 1.

The function flags isolated values / value groups. Isolated values are values / value groups, that, in a range of window, are surrounded either by already flagged or missing values only.

The function defaults to flag isolated single values only. But the parameters allow for detections of more complex isolation definitions, including groups of isolated values.

A continuous group of timeseries values x_{k}, x_{k+1},...,x_{k+n} is considered to be "isolated", if:

  1. There are no values, preceeding x_{k} within window or all the preceeding values within this range are flagged
  2. There are no values, succeeding x_{k+n}, within window, or all the succeeding values within this range are flagged
  3. n \leq group_size
  4. |y_{k} - y_{n+k}| < continuation_range, with y , denoting the series of timestamps associated with x.

missing

missing(nodata=NaN)
parameter data type default value description
nodata any NAN Value indicating missing values in the passed data.

The function flags those values in the the passed data series, that are associated with "missing" data. The missing data indicator (default: NAN), can be altered to any other value by passing this value to the parameter nodata.

clear

clear()

Remove all previously set flags.

force

force(flag)
parameter data type default value description
flag float/GOOD/BAD/UNFLAGGED GOOD flag to force

Force flags to the given flag value.

Spike Detection

spikes_basic

spikes_basic(thresh, tolerance, window_size)
parameter data type default value description
thresh float Minimum jump margin for spikes. See condition (1).
tolerance float Range of area, containing all "valid return values". See condition (2).
window_size string An offset string, denoting the maximum length of "spikish" value courses. See condition (3).

A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data.

The values x_{n}, x_{n+1}, .... , x_{n+k} of a passed timeseries x, are considered spikes, if:

  1. |x_{n-1} - x_{n + s}| > thresh, s \in \{0,1,2,...,k\}

  2. |x_{n-1} - x_{n+k+1}| < tolerance

  3. |y_{n-1} - y_{n+k+1}| < window_size, with y , denoting the series of timestamps associated with x .

By this definition, spikes are values, that, after a jump of margin thresh(1), are keeping that new value level they jumped to, for a timespan smaller than window_size (3), and do then return to the initial value level - within a tolerance margin of tolerance (2).

Note, that this characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses.

spikes_simpleMad

Flag outlier by simple median absolute deviation test.

spikes_simpleMad(winsz="1h", z=3.5)
parameter data type default value description
winsz offset-string or int "1h" size of the sliding window, where the modified Z-score is applied on
z float 3.5 z-parameter the modified Z-score

The modified Z-score [1] is used to detect outlier. All values are flagged as outlier, if in any slice of the sliding window, a value fulfills:

 0.6745 * |x - M| > mad * z > 0

with x, M, mad, z: window data, window median, window median absolute deviation, z. The window is moved by one frequency step.

Note: This function should only be applied on normalized data.

See also: [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm