Skip to content
Snippets Groups Projects
user avatar
authored

System for automated Quality Control (SaQC)

Quality Control of numerical data is an profoundly knowledge and experience based activity. Finding a robust setup is usually a time consuming and dynamic endeavor, even for an experienced data expert.

SaQC addresses the iterative and explorative characteristics of quality control with its extensive setup and configuration possibilities and a python based extension language. To make the system flexible, many aspects of the quality checking process, like

  • test parametrization
  • test evaluation and
  • test exploration

are easily configurable with plain text files.

Below its userinterface, SaQC is, thus, highly customizable and extensible. Well defined interfaces allow the extension with new quality check routines. Additionally, many core components, like the flagging scheme, are replaceable.


Why?

When it comes to the implementation of data workflows in the environmental sciences, our experience in (research) data management revealed a significant knowledege gap between the people collecting often large amounts of (environmental) data, and the persons responsible for the processing and the quality asssurence of these datasets. While the former usually have a good understanding of the underlying measurement principles, potential noise sources overlaying the actual signal and the expected characteristics of the dataset, the latter are mostly software developers with a good knowledge on how to implement data flows.

The main objective of SaQC is therefore to bridge this gap by allowing both parties to concentrate on their strengths: the data collector/owner should be able to express her ideas in an easy and succint way while the actual implementation of the data processing and quality checking is left to the respective experts.

How?

The most import aspect of SaQC, the general configuration of the system, is text-based. All the magic takes place in a semicolon-separated table file listing the variables within the dataset to inspect, quality control and/or modify.

While a good (but still growing) number of predifined and heighly configurable functions are included and ready to use, SaQC additionally ships with a python based extension language. The, let's call it slightly exxagerated, domain specific language (DSL), allows to define (more or less simple) tests to be written directly within in the configuration. The idea is, that many more complex datasets carry inherent physical and technical relationsships (like "if the variables indicating the health of an active cooling solution drops, the values of variable 'y' are useless"), that are way easier to express in text than in code.

For a more specific round trip to some of SaQC's possibilities, please refer to our HowTo.

Installation

pip

SaQC is available on the Python Package Index (PyPI) and can be installed using pip:

python -m pip install saqc

Manual installation

The latest development version is directly available from the gitlab server of the Helmholtz Center for Environmental Research.

Usage

Command line interface (CLI)

SaQC provides a basic CLI

Integration into the

User Defined Test

User defined tests allow to specify simple quality checks directly within the configuration.

Specification

  • Test name: generic
  • The parameter 'func' followed by an expression needs to be given
  • Example: generic, {func: (thisvar > 0) & ismissing(othervar)}

Restrictions

  • only the operators and functions listed below are available
  • all checks need to be conditional expression and have to return an array of boolean values, all other expressions are rejected. This limitation is enforced to somewhat narrow the scope of the system and therefore the potential to mess things up and might as well be removed in the future.

Syntax

  • standard Python syntax
  • all variables within the configuration file can be used

Supported Operators

  • all arithmetic operators
  • all comparison operators
  • bitwise operators: and, or, xor, complement (&, |, ^, ~)

Supported functions

function name description
abs absolute values of a variable
max maximum value of a variable
min minimum value of a variable
mean mean value of a variable
sum sum of a variable
std standard deviation of a variable
len the number of values of variable
ismissing check for missing values (nan and a possibly user defined value)

Referencing Semantics

If another variable is reference within an generic test, the flags from that variable are propagated to the checked variable.

For example: Let var1 and var2 be two variables of a given dataset and func: var1 > mean(var1) the condition wheter to flag var2. The result of the check can be described as isflagged(var1) & istrue(func()).

=======

License

Copyright(c) 2019, Helmholtz-Zentrum fuer Umweltforschung GmbH - UFZ. All rights reserved.

The "System for Automated Quality Control" is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the free Software Foundation either version 3 of the License, or (at your option) any later version. See the license for detaily.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.