Newer
Older
Generic Functions provide a way to leverage cross-variable conditions
and to implement simple quality checks directly within the configuration.
## Why?
The underlying idea is, that in most real world datasets many errors
can be explained by the dataset itself. Think of a an active, fan-cooled
measurement device: no matter how precise the instrument may work, problems
are to expected when the fan stop working or the battery voltage
drops below a certain threshold. While these dependencies are easy to
[formalize](#a-real-world-example) on a per dataset basis, it is quite
challenging to translate them into general purpose source code.
Generic functions are used in the same manner as their
[non-generic counterparts](docs/FunctionDescriptions.md). The basic
signature looks like that:
```sh
flagGeneric(func=<expression>, flag=<flagging_constant>)
```
where `<expression>` is composed of the [supported constructs](#supported-constructs)
and `<flag_constant>` is one of the predefined
[flagging constants](docs/ParameterDescriptions.md#flagging-constants) (default: `BAD`)
## Examples
### Simple comparisons
#### Task
Flag all values of variable `x` when variable `y` falls below a certain threshold
| varname | test |
|---------|-------------------------|
| x | flagGeneric(func=y < 0) |
### Calculations
#### Task
Flag all values of variable `x` that exceed 3 standard deviations of variable `y`
#### Configuration file
| varname | test |
|---------|-------------------------------------|
| x | flagGeneric(func=this > std(y) * 3) |
### Special functions
#### Task
Flag variable `x` where variable `y` is flagged and variable `x` has missing values
#### Configuration file
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
| varname | test |
|---------|-----------------------------------------------------|
| x | flagGeneric(func=this > isflagged(y) & ismissing(z) |
### A real world example
Let's consider a dataset like the following:
| date | meas | fan | volt |
|------------------|------|-----|------|
| 2018-06-01 12:00 | 3.56 | 1 | 12.1 |
| 2018-06-01 12:10 | 4.7 | 0 | 12.0 |
| 2018-06-01 12:20 | 0.1 | 1 | 11.5 |
| 2018-06-01 12:30 | 3.62 | 1 | 12.1 |
| ... | | | |
#### Task
Flag variable `meas` where variable `fan` equals 0 and variable `volt`
is lower than `12.0`.
#### Configuration file
We can directly implement the condition as follows:
| varname | test |
|---------|----------------------------------------------|
| meas | flagGeneric(func=(fan == 0) (volt < 12.0)) |
But we could also quality check our independent variables first
and than leverage this information later on:
| varname | test |
|---------|---------------------------------------------------------|
| * | missing() |
| fan | flagGeneric(func=this == 0) |
| volt | flagGeneric(func=this < 12.0) |
| meas | flagGeneric(func=isflagged(fan) | isflagged(volt)) |
All variables of the processed dataset are available within generic functions,
so arbitrary cross references are possible. The variable of interest
is furthermore available with the special reference `this`, so the second
[example](#calculations) could be rewritten as:
| varname | test |
|---------|----------------------------------|
| x | flagGeneric(func=x > std(y) * 3) |
When referencing other variables, their flags will be respected during evaluation
of the generic expression. So, in the example above only previously
unflagged values of `x` and `y` are used within the expression `x > std(y)*3`.
## Supported constructs
### Operators
The following comparison operators are available:
| Operator | Description |
|----------|----------------------------------------------------------------------------------------------------|
| `==` | `True` if the values of the operands are equal |
| `!=` | `True` if the values of the operands are not equal |
| `>` | `True` if the values of the left operand are greater than the values of the right operand |
| `<` | `True` if the values of the left operand are smaller than the values of the right operand |
| `>=` | `True` if the values of the left operand are greater or equal than the values of the right operand |
| `<=` | `True` if the values of the left operand are smaller or equal than the values of the right operand |
The following arithmetic operators are supported:
| Operator | Description |
|----------|----------------|
| `+` | addition |
| `*` | multiplication |
| `/` | division |
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
| `%` | modulus |
#### Bitwise
The bitwise operators also act as logical operators in comparison chains
| Operator | Description |
|----------|-------------------|
| `&` | binary and |
| | | binary or |
| `^` | binary xor |
| `~` | binary complement |
### Functions
All functions expect a [variable reference](#variable-references)
as the only non-keyword argument (see [here](#special-functions))
| Name | Description |
|-------------|-----------------------------------|
| `abs` | absolute values of a variable |
| `max` | maximum value of a variable |
| `min` | minimum value of a variable |
| `mean` | mean value of a variable |
| `sum` | sum of a variable |
| `std` | standard deviation of a variable |
| `len` | the number of values for variable |
| `ismissing` | check for missing values |
| `isflagged` | check for flags |
### Constants
Generic functions support the same constants as normal functions, a detailed
list is available [here](docs/ParameterDescriptions.md#constants).