Skip to content
Snippets Groups Projects
user avatar
authored

Dependencies

  • numpy
  • pandas
  • numba
  • pyyaml

Test Syntax

Specification

  • Test specifications are written in YAML and contain:
    • A test name, either on of the pre-defined tests or 'generic'
    • Optionally a set of parametes. These should be given in json-object or yaml/python-dictionary style (i.e. {key: value})
    • test name and parameter object/dictionary need to be seperated by comma
  • Example: limits, {min: 0, max: 100}

Optional Test Parameters

  • flag: The value to set (more precisely the value to pass to the flagging component) if the tests does not pass
  • flag_period:
    • if a value is flagged, so is the given time period following the timestamp of that value

    • Number followed by a frequency specification, e.g. '5min', '6D'. A comprehensive list of the supported frequies can be found in the table 'Offset Aliases' in the Pandas Docs. The (probably) most common options are also listed below:

      frequency string description
      D one day
      H one hour
      T or min one minute
      S one second
  • flag_values:
    • Number
    • if a value is flagged, so are the next n previously unflagged values
  • assign:
    • boolean
    • Assign the test result to a new column

Predefined Tests

name required parameters optional parameters description
mad z, length deriv = 1 mean absolute deviation with measure of
central tendency z and an
rolling window of size length. Optionally
deriv's derivate of the dataset is
calculated first.

User Defined Test

User defined tests allow to specify simple quality checks directly within the configuration.

Specification

  • Test name: generic
  • The parameter 'func' followed by an expression needs to be given
  • Example: generic, {func: (thisvar > 0) & ismissing(othervar)}

Restrictions

  • only the operators and functions listed below are available
  • all checks need to be conditional expression and have to return an array of boolean values, all other expressions are rejected. This limitation is enforced to somewhat narrow the scope of the system and therefore the potential to mess things up and might as well be removed in the future.

Syntax

  • standard Python syntax
  • all variables within the configuration file can be used

Supported Operators

  • all arithmetic operators
  • all comparison operators
  • bitwise operators: and, or, xor, complement (&, |, ^, ~)

Supported functions

function name description
abs absolute values of a variable
max maximum value of a variable
min minimum value of a variable
mean mean value of a variable
sum sum of a variable
std standard deviation of a variable
len the number of values of variable
ismissing check for missing values (nan and a possibly user defined value)

Referencing Semantics

If another variable is reference within an generic test, the flags from that variable are propagated to the checked variable.

For example: Let var1 and var2 be two variables of a given dataset and func: var1 > mean(var1) the condition wheter to flag var2. The result of the check can be described as isflagged(var1) & istrue(func()).

Contributing

Testing

Please run the tests before you commit!

python -m pytest test

can save us a lot of time...

New QC-Algorithms

Currently all test algorithms are collected within the module funcs.functions. In order to make your test available for the system you need to:

  • Place your code into the file funcs/functions.py
  • Register your function by adding it to the dictionary func_map within the function body of funcs.functions.flagDispatch. Your function will be available to the system by its key.
  • Implement the common interface:
    • Function input: Your function needs to accept the following arguments:
      • data: pd.DataFrame: A dataframe holding the entire dataset (i.e. not only the variable, the current test is performed on)
      • flags: pd.DataFrame: A dataframe holding the flags for the entire dataset
      • field: String: The name of the variable the current test is performed on. The data and flags for this variable is available via data[field] and flags[field] respectively
      • flagger: flagger.BaseFlagger: An instance of the BaseFlagger class (more likely one of its subclasses). To initialize, create or check against existing flags you should use the respective flagger-methods (flagger.empytFlags, flagger.isFlagged and flagger.setFlag)
      • **kwargs: Any: All the parameters given in the configuration file are passed to your function, you are of course free to make some of them requires by the signature. kwargs should be passed on to the flagger.setFlag methods, in order to allow configuration based fine tuning of the flagging
    • Function output: Your function needs to return two DataFrame/ndarray, data and flags. As the names suggest, the first holds the data, the second the possibly modified flags
    • Note: The choosen interface allows you to not only manipulate the flags, but also the data of the entire dataset within your function body. This freedom might come in handy, but also requires a certain amount of care to not mess things up!
    • Example: The function flagRange in funcs/functions.py may serve as an simple example of the general scheme