Skip to content
Snippets Groups Projects
README.md 5.57 KiB
Newer Older
# System for automated Quality Control (SaQC)

David Schäfer's avatar
David Schäfer committed
Quality Control of numerical data requires a significant amount of
domain knowledge and practical experience. Finding a robust setup of
David Schäfer's avatar
David Schäfer committed
quality tests that identifies as many suspicious values as possible, without
David Schäfer's avatar
David Schäfer committed
removing valid data, is usually a time-consuming and iterative endeavor,
even for experts.

SaQC is both, a Python framework and a command line application, that
addresses the exploratory nature of quality control by offering a
continuously growing number of quality check routines through a flexible
David Schäfer's avatar
David Schäfer committed
and simple configuration system.
David Schäfer's avatar
David Schäfer committed

Below its user interface, SaQC is highly customizable and extensible.
David Schäfer's avatar
David Schäfer committed
A modular structure and well-defined interfaces make it easy to extend
Juliane Geller's avatar
Juliane Geller committed
the system with custom quality checks. Furthermore, even core components like
the flagging scheme are exchangeable.
David Schäfer's avatar
David Schäfer committed
![SaQC Workflow](ressources/images/readme_image.png "SaQC Workflow")
David Schäfer's avatar
David Schäfer committed

## Why?
David Schäfer's avatar
David Schäfer committed
During the implementation of data workflows in environmental sciences,
our experience shows a significant knowledge gap between the people
David Schäfer's avatar
David Schäfer committed
collecting data and those responsible for the processing and the
quality-control of these datasets.
David Schäfer's avatar
David Schäfer committed
While the former usually have a solid understanding of the underlying
David Schäfer's avatar
David Schäfer committed
physical properties, measurement principles and the resulting errors,
the latter are mostly software developers with expertise in
David Schäfer's avatar
David Schäfer committed
data processing.

The main objective of SaQC is to bridge this gap by allowing both
parties to focus on their strengths: The data collector/owner should be
David Schäfer's avatar
David Schäfer committed
able to express his/her ideas in an easy and succinct way, while the actual
David Schäfer's avatar
David Schäfer committed
implementation of the algorithms is left to the respective developers.
David Schäfer's avatar
David Schäfer committed


## How?
Juliane Geller's avatar
Juliane Geller committed
`SaQC` is both a command line application controlled by a text based configuration file and a python
David Schäfer's avatar
David Schäfer committed
module with a simple API.
David Schäfer's avatar
David Schäfer committed

While a good (but still growing) number of predefined and highly configurable
[functions](docs/FunctionIndex.md) are included and ready to use, SaQC
David Schäfer's avatar
David Schäfer committed
additionally ships with a python based
[extension language](docs/GenericFunctions.md) for quality and general
purpose data processing.
David Schäfer's avatar
David Schäfer committed

Juliane Geller's avatar
Juliane Geller committed
For a more specific round trip to some of SaQC's possibilities, we refer to
our [GettingStarted](docs/GettingStarted.md).
David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed
### SaQC as a command line application
Most of the magic is controlled by a
Juliane Geller's avatar
Juliane Geller committed
[semicolon-separated text file](saqc/docs/ConfigurationFiles.md) listing the variables of the
David Schäfer's avatar
David Schäfer committed
dataset and the routines to inspect, quality control and/or process them.
The content of such a configuration could look like this:

```
varname    ; test                                
#----------;------------------------------------
SM2        ; harm_shift2Grid(freq="15Min")       
SM2        ; flagMissing(nodata=NAN)             
'SM(1|2)+' ; flagRange(min=10, max=60)           
SM2        ; spikes_flagMad(window="30d", z=3.5)
```

As soon as the basic inputs, a dataset and the configuration file are
prepared, running SaQC is as simple as:
```sh
saqc \
    --config path_to_configuration.txt \
    --data path_to_data.csv \
    --outfile path_to_output.csv
```

### SaQC as a python module

The following snippet implements the same configuration given above through
the Python-API:

```python
from saqc import SaQC, SimpleFlagger

saqc = (SaQC(SimpleFlagger(), data)
David Schäfer's avatar
David Schäfer committed
        .harm_shift2Grid("SM2", freq="15Min")
        .flagMissing("SM2", nodata=np.nan)
        .flagRange("SM(1|2)+", regex=True, min=10, max=60)
        .spikes_flagMad("SM2", window="30d", z=3.5))
        
data, flagger = saqc.getResult()
```

David Schäfer's avatar
David Schäfer committed
## Installation

David Schäfer's avatar
David Schäfer committed
### Python Package Index
David Schäfer's avatar
David Schäfer committed
SaQC is available on the Python Package Index ([PyPI](https://pypi.org/)) and
can be installed using [pip](https://pip.pypa.io/en/stable/):
David Schäfer's avatar
David Schäfer committed
```sh
python -m pip install saqc
```

David Schäfer's avatar
David Schäfer committed
### Anaconda
Currently we don't provide pre-build conda packages but the installing of `SaQC`
using the [conda package manager](https://docs.conda.io/en/latest/) is
straightforward:
1. Create an anaconda environment including all the necessary dependencies with:
   ```sh
   conda env create -f environment.yml
   ```
2. Load the freshly created environment with:
   ```sh
   conda activate saqc
   ```

David Schäfer's avatar
David Schäfer committed
### Manual installation
David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed
The latest development version is directly available from the
[gitlab](https://git.ufz.de/rdm-software/saqc) server of the
David Schäfer's avatar
David Schäfer committed
[Helmholtz Center for Environmental Research](https://www.ufz.de/index.php?en=33573).
More details on how to setup an respective environment are available
[here](CONTRIBUTING.md#development-environment)
David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed
### Python version
David Schäfer's avatar
David Schäfer committed
The minimum Python version required is 3.6.
David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed

David Schäfer's avatar
David Schäfer committed
## Usage
### Command line interface (CLI)
David Schäfer's avatar
David Schäfer committed
SaQC provides a basic CLI to get you started. As soon as the basic inputs,
David Schäfer's avatar
David Schäfer committed
a dataset and the [configuration file](saqc/docs/ConfigurationFiles.md) are
prepared, running SaQC is as simple as:
David Schäfer's avatar
David Schäfer committed
```sh
David Schäfer's avatar
David Schäfer committed
saqc \
David Schäfer's avatar
David Schäfer committed
    --config path_to_configuration.txt \
    --data path_to_data.csv \
    --outfile path_to_output.csv
```


David Schäfer's avatar
David Schäfer committed
### Integration into larger workflows
David Schäfer's avatar
David Schäfer committed
The main function is [exposed](saqc/core/core.py#L79) and can be used in within
your own programs.
David Schäfer's avatar
David Schäfer committed


## License
David Schäfer's avatar
David Schäfer committed
Copyright(c) 2019,
Helmholtz Centre for Environmental Research - UFZ.
Lennart Schmidt's avatar
Lennart Schmidt committed
All rights reserved.

David Schäfer's avatar
David Schäfer committed
The "System for Automated Quality Control" is free software. You can
redistribute it and/or modify it under the terms of the GNU General
Public License as published by the free Software Foundation either
David Schäfer's avatar
David Schäfer committed
version 3 of the License, or (at your option) any later version. See the
Bert Palm's avatar
Bert Palm committed
[license](LICENSE.txt) for details.
David Schäfer's avatar
David Schäfer committed
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Peter Lünenschloß's avatar
Peter Lünenschloß committed
See the GNU General Public License for more details.