-
Peter Lünenschloß authoredd0cdc73a
- Implemented QC functions
- range
- Signature
- Parameters
- Description
- missing
- Signature
- Parameters
- Description
- seasonalRange
- Signature
- Parameters
- Description
- clear
- Signature
- Parameters
- Description
- force
- Signature
- Parameters
- Description
- sliding_outlier
- Signature
- Parameters
- Description
- mad
- Signature
- Parameters
- Description
- Spikes_Basic
- Signature
- Parameters
- Description
- Spikes_SpektrumBased
- Signature
- Parameters
- Description
- constant
- Signature
- Parameters
- Description
- constants_varianceBased
- Signature
- Parameters
- Description
- constant
- Signature
- Parameters
- Description
- SoilMoistureSpikes
- Signature
- Parameters
- Description
- SoilMoistureBreaks
- Signature
- Parameters
- Description
- SoilMoistureByFrost
- Signature
- Parameters
- Description
- SoilMoistureByPrecipitation
- Signature
- Parameters
- Description
- Breaks_SpektrumBased
- Signature
- Parameters
- Description
- machinelearning
- Signature
- Parameters
- Description
Implemented QC functions
range
Signature
range(min, max)
Parameters
parameter | data type | default value | description |
---|---|---|---|
min | float | ||
max | float |
Description
missing
Signature
missing(nodata=NaN)
Parameters
parameter | data type | default value | description |
---|---|---|---|
nodata | any | NaN |
Value indicating missing values in the passed data |
Description
The function flags those values in the the passed data series, that are
associated with "missing" data. The missing data indicator (default: NaN
), can
be altered to any other value by passing this new value to the parameter nodata
.
seasonalRange
Signature
sesonalRange(min, max, startmonth=1, endmonth=12, startday=1, endday=31)
Parameters
parameter | data type | default value | description |
---|---|---|---|
min | float | ||
max | float | ||
startmonth | integer | 1 |
|
endmonth | integer | 12 |
|
startday | integer | 1 |
|
endday | integer | 31 |
Description
clear
Signature
clear()
Parameters
parameter | data type | default value | description |
---|
Description
Remove all previously set flags.
force
Signature
force()
Parameters
parameter | data type | default value | description |
---|
Description
sliding_outlier
Signature
sliding_outlier(winsz="1h", dx="1h", count=1, deg=1, z=3.5, method="modZ")
Parameters
parameter | data type | default value | description |
---|---|---|---|
winsz | string | "1h" |
|
dx | string | "1h" |
|
count | integer | 1 |
|
deg | integer | 1" |
|
z | float | 3.5 |
|
method | string | "modZ" |
Description
mad
Signature
mad(length, z=3.5, freq=None)
Parameters
parameter | data type | default value | description |
---|---|---|---|
length | |||
z | float | 3.5 |
|
freq | None |
Description
Spikes_Basic
Signature
Spikes_Basic(thresh, tolerance, window_size)
Parameters
parameter | data type | default value | description |
---|---|---|---|
thresh | float | Minimum jump margin for spikes. See condition (1). | |
tolerance | float | Range of area, containing al "valid return values". See condition (2). | |
window_size | ftring | An offset string, denoting the maximal length of "spikish" value courses. See condition (3). |
Description
A basic outlier test, that is designed to work for harmonized, as well as raw (not-harmonized) data.
The values
-
thresh
, -
tolerance
-
window_size
, with, denoting the series of timestamps associated with.
By this definition, spikes are values, that, after a jump of margin thresh
(1),
are keeping that new value level they jumped to, for a timespan smaller than
window_size
(3), and do then return to the initial value level -
within a tolerance margin of tolerance
(2).
Note, that this characterization of a "spike", not only includes one-value outliers, but also plateau-ish value courses.
The implementation is a time-window based version of an outlier test from the UFZ Python library, that can be found here.
Spikes_SpektrumBased
Signature
Spikes_SpektrumBased(raise_factor=0.15, dev_cont_factor=0.2,
noise_barrier=1, noise_window_size="12h", noise_statistic="CoVar",
smooth_poly_order=2, filter_window_size=None)
Parameters
parameter | data type | default value | description |
---|---|---|---|
raise_factor | float | 0.15 |
Minimum change margin for a datapoint to become a candidate for a spike. See condition (1). |
dev_cont_factor | float | 0.2 |
See condition (2). |
noise_barrier | float | 1 |
Upper bound for noisyness of data surrounding potential spikes. See condition (3). |
noise_window_range | string | "12h" |
Any offset string. Determines the range of the timewindow of the "surrounding" data of a potential spike. See condition (3). |
noise_statistic | string | "CoVar" |
Operator to calculate noisyness of data, surrounding potential spike. Either "Covar" (=Coefficient od Variation) or "rvar" (=relative Variance). |
smooth_poly_order | integer | 2 |
Order of the polynomial fit, applied for smoothing |
filter_window_size | Nonetype or string | None |
Options: - None - any offset string Controlls the range of the smoothing window applied with the Savitsky-Golay filter. If None is passed (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If you are not very well knowing what you are doing - do not change that value. Broader window sizes caused unexpected results during testing phase. |
Description
The function detects and flags spikes in input data series by evaluating the the timeseries' derivatives and applying some conditions to them.
NOTE, that the dataseries-to-be flagged is supposed to be harmonized to an equadistant frequencie grid.
A datapoint
- The quotient to its preceeding datapoint exceeds a certain bound:
-
raise_factor
, or: -
raise_factor
-
- The quotient of the datas second derivate , at the preceeding and subsequent timestamps is close enough to 1:
-
dev_cont_factor
, and -
dev_cont_factor
-
- The dataset, , surrounding, within
noise_window_range
range, but excluding, is not too noisy. Wheras the noisyness gets measured bynoise_statistic
:-
noise_statistic
noise_barrier
-
NOTE, that the derivative is calculated after applying a savitsky-golay filter to
This Function is a generalization of the Spectrum based Spike flagging mechanism as presented in:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
constant
Signature
constant(eps, length, thmin=None)
Parameters
parameter | data type | default value | description |
---|---|---|---|
eps | |||
length | |||
thmin | None |
Description
constants_varianceBased
Signature
constants_varianceBased(plateau_window_min="12h", plateau_var_limit=0.0005,
var_total_nans=Inf, var_consec_nans=Inf)
Parameters
parameter | data type | default value | description |
---|---|---|---|
plateau_window_min | string | Options - any offset string Minimum barrier for the duration, values have to be continouos to be plateau canditaes. See condition (1). |
|
plateau_var_limit | float | 0.0005 |
Barrier, the variance of a group of values must not exceed to be flagged a plateau. See condition (2). |
var_total_nans | integer | Inf |
Maximum number of nan values allowed, for a calculated variance to be valid. (Default skips the condition.) |
var_consec_nans | integer | Inf |
Maximum number of consecutive nan values allowed, for a calculated variance to be valid. (Default skips the condition.) |
Description
Function flags plateaus/series of constant values. Any set of consecutive values
-
plateau_window_min
-
<
plateau_var_limit
NOTE, that the dataseries-to-be flagged is supposed to be harmonized to an equadistant frequency grid.
NOTE, that when var_total_nans
or var_consec_nans
are set to a value < Inf
, plateaus that can not be calculated the variance of, due to missing values,
will never be flagged. (Test not applicable rule.)
constant
Signature
soilMoisture_plateaus(plateau_window_min="12h", plateau_var_limit=0.0005,
rainfall_window_range="12h", var_total_nans=np.inf,
var_consec_nans=np.inf, derivative_max_lb=0.0025,
derivative_min_ub=0, data_max_tolerance=0.95,
filter_window_size=None, smooth_poly_order=2, **kwargs)
Parameters
parameter | data type | default value | description |
---|---|---|---|
plateau_window_min | string | "12h" |
Options - any offset string Minimum barrier for the duration, values have to be continouos to be plateau canditaes. See condition (1). |
plateau_var_limit | float | 0.0005 |
Barrier, the variance of a group of values must not exceed to be flagged a plateau. See condition (2). |
rainfall_range | string | "12h" |
An Offset string. See condition (3) and (4) |
var_total_nans | int or 'inf' | np.inf |
Maximum number of nan values allowed, for a calculated variance to be valid. (Default skips the condition.) |
var_consec_nans | int or 'inf' | np.inf |
Maximum number of consecutive nan values allowed, for a calculated variance to be valid. (Default skips the condition.) |
derivative_max_lb | float | 0.0025 |
Lower bound for the second derivatives maximum in rainfall_range range. See condition (3) |
derivative_min_ub | float | 0 |
Upper bound for the second derivatives minimum in rainfall_range range. See condition (4) |
data_max_tolerance | flaot | 0.95 |
Factor for data max barrier of condition (5). |
filter_window_size | Nonetype or string | None |
Options: - None - any offset string Controlls the range of the smoothing window applied with the Savitsky-Golay filter. If None is passed (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If you are not very well knowing what you are doing - do not change that value. Broader window sizes caused unexpected results during testing phase. |
smooth_poly_order | int | 2 |
Order of the polynomial used for fitting while smoothing. |
Description
NOTE, that the dataseries-to-be flagged is supposed to be harmonized to an equadistant frequency grid.
The function represents a stricter version of the constant_varianceBased
test from the constants detection library. The added constraints for values to
be flagged (3)-(5), are designed to match the special case of constant value courses of
soil moisture meassurements and basically check the derivative for being
determined by preceeding rainfall events ((3) and (4)), as well as the plateau
for being sufficiently high in value (5).
Any set of consecutive values
-
plateau_window_min
-
plateau_var_limit
-
derivative_max_lb
, withdenoting periods perrainfall_range
-
derivative_min_ub
, withdenoting periods perrainfall_range
-
plateau_var_limit
This Function is an implementation of the soil temperature based Soil Moisture flagging, as presented in:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
All parameters default to the values, suggested in this publication.
SoilMoistureSpikes
Signature
SoilMoistureSpikes(filter_window_size="3h", raise_factor=0.15, dev_cont_factor=0.2,
noise_barrier=1, noise_window_size="12h", noise_statistic="CoVar")
Parameters
parameter | data type | default value | description |
---|---|---|---|
filter_window_size | string | "3h" |
|
raise_factor | float | 0.15 |
|
dev_cont_factor | float | 0.2 |
|
noise_barrier | integer | 1 |
|
noise_window_size | string | "12h" |
|
noise_statistic | string | "CoVar" |
Description
The Function is just a wrapper around flagSpikes_spektrumBased
, from the
spike detection library and performs a call to this function with a parameter
set, referring to:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
SoilMoistureBreaks
Signature
SoilMoistureBreaks(diff_method="raw", filter_window_size="3h",
rel_change_rate_min=0.1, abs_change_min=0.01, first_der_factor=10,
first_der_window_size="12h", scnd_der_ratio_margin_1=0.05,
scnd_der_ratio_margin_2=10, smooth_poly_order=2)
Parameters
parameter | data type | default value | description |
---|---|---|---|
diff_method | string | "raw" |
|
filter_window_size | string | "3h" |
|
rel_change_rate_min | float | 0.1 |
|
abs_change_min | float | 0.01 |
|
first_der_factor | integer | 10 |
|
first_der_window_size | string | "12h" |
|
scnd_der_ratio_margin_1 | float | 0.05 |
|
scnd_der_ratio_margin_2 | float | 10.0 |
|
smooth_poly_order | integer | 2 |
Description
The Function is just a wrapper around flagBreaks_spektrumBased
, from the
breaks detection library and performs a call to this function with a parameter
set, referring to:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
SoilMoistureByFrost
Signature
SoilMoistureByFrost(soil_temp_reference, tolerated_deviation="1h", frost_level=0)
Parameters
parameter | data type | default value | description |
---|---|---|---|
soil_temp_reference | string | A string, denoting the fields name in data, that holds the data series of soil temperature values, the to-be-flagged values shall be checked against. | |
tolerated_deviation | string | "1h" |
An offset string, denoting the maximal temporal deviation, the soil frost states timestamp is allowed to have, relative to the data point to be flagged. |
frost_level | integer | 0 |
Value level, the flagger shall check against, when evaluating soil frost level. |
Description
The function flags Soil moisture measurements by evaluating the soil-frost-level
in the moment of measurement (+/- tolerated deviation
).
Soil temperatures below "frost_level" are regarded as denoting frozen soil
state and result in the checked soil moisture value to get flagged.
This Function is an implementation of the soil temperature based Soil Moisture flagging, as presented in:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
All parameters default to the values, suggested in this publication.
SoilMoistureByPrecipitation
Signature
SoilMoistureByPrecipitation(prec_reference, sensor_meas_depth=0,
sensor_accuracy=0, soil_porosity=0,
std_factor=2, std_factor_range="24h"
ignore_missing=False)
Parameters
parameter | data type | default value | description |
---|---|---|---|
prec_reference | string | A string, denoting the fields name in data, that holds the data series of precipitation values, the to-be-flagged values shall be checked against. | |
sensor_meas_depth | integer | 0 |
Depth of the soil moisture sensor in meter. |
sensor_accuracy | integer | 0 |
Soil moisture sensor accuracy in |
soil_porosity | integer | 0 |
Porosoty of the soil, surrounding the soil moisture sensor |
std_factor | integer | 2 |
See condition (2) |
std_factor_range | string | "24h" |
See condition (2) |
ignore_missing | bool | False |
If True, the variance of condition (2), will also be calculated if there is a value missing in the time window. Selcting Flase (default) results in values that succeed a time window containing a missing value never being flagged (test not applicable rule) |
Description
Function flags Soil moisture measurements by flagging moisture rises that do not follow up a sufficient precipitation event. If measurement depth, sensor accuracy of the soil moisture sensor and the porosity of the surrounding soil is passed to the function, an inferior level of precipitation, that has to preceed a significant moisture raise within 24 hours, can be estimated. If those values are not delivered, this inferior bound is set to zero. In that case, any non zero precipitation count will justify any soil moisture raise.
Thus, a data point
- The value to be flagged has to signify a rise. This means, for the quotient (
raise_reference
/): - The rise must be sufficient. Meassured in terms of the standart deviation
, of the values in the preceeding
std_factor_range
- window. This means, withstd_factor_range
/:-
std_factor
-
- Depending on some sensor specifications, there can be calculated a bound , the rainfall has to exceed to justify the eventual soil moisture raise. For the series of the precipitation meassurements, and the quotient"24h" /, this means:
-
sensor_meas_depth
sensor_accuracy
\times
soil_porosity
-
Function flags Soil moisture measurements by flagging moisture rises that do not follow up a sufficient precipitation event. If measurement depth, sensor accuracy of the soil moisture sensor and the porosity of the surrounding soil is passed to the function, an inferior level of precipitation, that has to preceed a significant moisture raise within 24 hours, can be estimated. If those values are not delivered, this inferior bound is set to zero. In that case, any non zero precipitation count will justify any soil moisture raise.
This Function is an implementation of the precipitation based Soil Moisture flagging, as presented in:
Dorigo, W. et al: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
All parameters default to the values, suggested in this publication.
Breaks_SpektrumBased
Signature
Breaks_SpektrumBased(rel_change_min=0.1, abs_change_min=0.01, first_der_factor=10,
first_der_window_size="12h", scnd_der_ratio_margin_1=0.05,
scnd_der_ratio_margin_2=10, smooth_poly_order=2,
diff_method="raw", filter_window_size="3h")
Parameters
parameter | data type | default value | description |
---|---|---|---|
rel_change_rate_min | float | 0.1 |
Lower bound for the relative difference, a value has to have to its preceeding value, to be a candidate for being break-flagged. See condition (2). |
abs_change_min | float | 0.01 |
Lower bound for the absolute difference, a value has to have to its preceeding value, to be a candidate for being break-flagged. See condition (1). |
first_der_factor | float | 10 |
Factor of the second derivates "arithmetic middle bound". See condition (3). |
first_der_window_size | string | "12h" |
Options: - any offset String Determining the size of the window, covering all the values included in the the arithmetic middle calculation of condition (3). |
scnd_der_ratio_margin_1 | float | 0.05 |
Range of the area, covering all the values of the second derivatives quotient, that are regarded "sufficiently close to 1" for signifying a break. See condition (5). |
scnd_der_ratio_margin_2 | float | 10.0 |
Lower bound for the break succeeding second derivatives quotients. See condition (5). |
smooth_poly_order | integer | 2 |
When calculating derivatives from smoothed timeseries (diff_method="savgol"), this value gives the order of the fitting polynomial calculated in the smoothing process. |
diff_method | string | `"savgol" | Options: - "savgol" - "raw" Select "raw", to skip smoothing before differenciation. |
filter_window_size | Nonetype or string | None |
Options: - None - any offset string Controlls the range of the smoothing window applied with the Savitsky-Golay filter. If None is passed (default), the window size will be two times the sampling rate. (Thus, covering 3 values.) If you are not very well knowing what you are doing - do not change that value. Broader window sizes caused unexpected results during testing phase. |
Description
The function flags breaks (jumps/drops) in input measurement series by evaluating its derivatives.
NOTE, that the dataseries-to-be flagged is supposed to be harmonized to an equadistant frequencie grid.
NOTE, that the derivatives are calculated after applying a savitsky-golay filter
to x
.
A value x_k
of a data series x
, is flagged a break, if:
-
x_k
represents a sufficient absolute jump in the course of data values:-
|x_k - x_{k-1}| >
abs_change_min
-
-
x_k
represents a sufficient relative jump in the course of data values:-
|\frac{x_k - x_{k-1}}{x_k}| >
rel_change_min
-
- Let
X_k
be the set of all values that lie within afirst_der_window_range
range aroundx_k
. Then, for its arithmetic mean\bar{X_k}
, following equation has to hold:-
|x'_k| >
first_der_factor
\times \bar{X_k}
-
- The second derivations quatients are "sufficiently equalling 1":
-
1 -
scnd_der_ratio_margin_1
< |\frac{x''_{k-1}}{x_{k''}}| < 1 +
scnd_der_ratio_margin_1
-
- The the succeeding second derivatives values quotient has to be sufficiently high:
-
|\frac{x''_{k}}{x''_{k+1}}| >
scnd_der_ratio_margin_2
-
This Function is a generalization of the Spectrum based Spike flagging mechanism as presented in:
Dorigo,W. et al.: Global Automated Quality Control of In Situ Soil Moisture Data from the international Soil Moisture Network. 2013. Vadoze Zone J. doi:10.2136/vzj2012.0097.
machinelearning
Signature
machinelearning(references, window_values, window_flags, path)
Parameters
parameter | data type | default value | description |
---|---|---|---|
references | string or list of strings | the fieldnames of the data series that should be used as reference variables | |
window_values | integer | Window size that is used to derive the gradients of both the field- and reference-series inside the moving window | |
window_flags | integer | Window size that is used to count the surrounding automatic flags that have been set before | |
path | string | Path to the respective model object, i.e. its name and the respective value of the grouping variable. e.g. "models/model_0.2.pkl" |
Description
This Function uses pre-trained machine-learning model objects for flagging. This requires training a model by use of the training script provided. For flagging, inputs to the model are the data of the variable of interest, data of reference variables and the automatic flags that were assigned by other tests inside SaQC. Internally, context information for each point is gathered in form of moving windows. The size of the moving windows for counting of the surrounding automatic flags and for calculation of gradients in the data is specified by the user during model training. For the model to work, the parameters 'references', 'window_values' and 'window_flags' have to be set to the same values as during training. For a more detailed description of the modeling aproach see the training script.