System for automated Quality Control (SaQC)
Quality Control of numerical data is an profoundly knowledge and experience based activity. Finding a robust setup is usually a time consuming and dynamic endeavor, even for an experienced data expert.
SaQC addresses the iterative and explorative characteristics of quality control with its extensive setup and configuration possibilities and a python based extension language. To make the system flexible, many aspects of the quality checking process, like
- test parametrization
- test evaluation and
- test exploration
are easily configurable with plain text files.
Below its userinterface, SaQC is, thus, highly customizable and extensible. Well defined interfaces allow the extension with new quality check routines. Additionally, many core components, like the flagging scheme, are replaceable.
Why?
When it comes to the implementation of data workflows in the environmental sciences, our experience in (research) data management revealed a significant knowledege gap between the people collecting often large amounts of (environmental) data, and the persons responsible for the processing and the quality asssurence of these datasets. While the former usually have a good understanding of the underlying measurement principles, potential noise sources overlaying the actual signal and the expected characteristics of the dataset, the latter are mostly software developers with a good knowledge on how to implement data flows.
The main objective of SaQC is therefore to bridge this gap by allowing both parties to concentrate on their strengths: the data collector/owner should be able to express her ideas in an easy and succint way while the actual implementation of the data processing and quality checking is left to the respective experts.
How?
The most import aspect of SaQC, the general configuration of the system, is text-based. All the magic takes place in a semicolon-separated table file listing the variables within the dataset to inspect, quality control and/or modify.
While a good (but still growing) number of predifined and heighly configurable functions are included and ready to use, SaQC additionally ships with a python based extension language. The, let's call it slightly exxagerated, domain specific language (DSL), allows to define (more or less simple) tests to be written directly within in the configuration. The idea is, that many more complex datasets carry inherent physical and technical relationsships (like "if the variables indicating the health of an active cooling solution drops, the values of variable 'y' are useless"), that are way easier to express in text than in code.
For a more specific round trip to some of SaQC's possibilities, please refer to our HowTo.
Installation
pip
SaQC is available on the Python Package Index (PyPI) and can be installed using pip:
python -m pip install saqc
Manual installation
The latest development version is directly available from the gitlab server of the Helmholtz Center for Environmental Research.
Usage
Command line interface (CLI)
SaQC provides a basic CLI
Integration into the
User Defined Test
User defined tests allow to specify simple quality checks directly within the configuration.
Specification
- Test name:
generic
- The parameter 'func' followed by an expression needs to be given
- Example: generic,
{func: (thisvar > 0) & ismissing(othervar)}
Restrictions
- only the operators and functions listed below are available
- all checks need to be conditional expression and have to return an array of boolean values, all other expressions are rejected. This limitation is enforced to somewhat narrow the scope of the system and therefore the potential to mess things up and might as well be removed in the future.
Syntax
- standard Python syntax
- all variables within the configuration file can be used
Supported Operators
- all arithmetic operators
- all comparison operators
- bitwise operators: and, or, xor, complement (
&
,|
,^
,~
)
Supported functions
function name | description |
---|---|
abs |
absolute values of a variable |
max |
maximum value of a variable |
min |
minimum value of a variable |
mean |
mean value of a variable |
sum |
sum of a variable |
std |
standard deviation of a variable |
len |
the number of values of variable |
ismissing |
check for missing values (nan and a possibly user defined value) |
Referencing Semantics
If another variable is reference within an generic test, the flags from that variable are propagated to the checked variable.
For example:
Let var1
and var2
be two variables of a given dataset and func: var1 > mean(var1)
the condition wheter to flag var2
. The result of the check can be described
as isflagged(var1) & istrue(func())
.
=======
License
Copyright(c) 2019, Helmholtz-Zentrum fuer Umweltforschung GmbH - UFZ. All rights reserved.
The "System for Automated Quality Control" is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the free Software Foundation either version 3 of the License, or (at your option) any later version. See the license for detaily.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.