Skip to content
Snippets Groups Projects
To learn more about this project, read the wiki.
README.md 7.93 KiB

System for automated Quality Control (SaQC)

Quality Control of numerical data is an profoundly knowledge and experience based activity. Finding a robust setup is usually a time consuming and dynamic endeavor, even for an experienced data expert.

SaQC addresses the iterative and explorative characteristics of quality control with its extensive setup and configuration possibilities and a python based extension language. To make the system flexible, many aspects of the quality checking process, like

  • test parametrization
  • test evaluation and
  • test exploration

are easily configurable with plain text files.

Below its userinterface, SaQC is, thus, highly customizable and extensible. Well defined interfaces allow the extension with new quality check routines. Additionally, the core components like the flagging scheme are replaceable.


Dependencies

  • numpy
  • pandas
  • numba
  • pyyaml

Test Syntax

Specification

  • Test specifications are written in YAML and contain:
    • A test name, either on of the pre-defined tests or 'generic'
    • Optionally a set of parametes. These should be given in json-object or yaml/python-dictionary style (i.e. {key: value})
    • test name and parameter object/dictionary need to be seperated by comma
  • Example: limits, {min: 0, max: 100}

Optional Test Parameters

  • flag: The value to set (more precisely the value to pass to the flagging component) if the tests does not pass
  • flag_period:
    • if a value is flagged, so is the given time period following the timestamp of that value

    • Number followed by a frequency specification, e.g. '5min', '6D'. A comprehensive list of the supported frequies can be found in the table 'Offset Aliases' in the Pandas Docs. The (probably) most common options are also listed below:

      frequency string description
      D one day
      H one hour
      T or min one minute
      S one second
  • flag_values:
    • Number
    • if a value is flagged, so are the next n previously unflagged values
  • assign:
    • boolean
    • Assign the test result to a new column

Predefined Tests

name required parameters optional parameters description
mad z, length deriv = 1 mean absolute deviation with measure of
central tendency z and an
rolling window of size length. Optionally
deriv's derivate of the dataset is
calculated first.

User Defined Test

User defined tests allow to specify simple quality checks directly within the configuration.

Specification

  • Test name: generic
  • The parameter 'func' followed by an expression needs to be given
  • Example: generic, {func: (thisvar > 0) & ismissing(othervar)}