-
David Schäfer authored2ed4d91a
Getting started with SaQC
This "getting started" assumes that you have Python version 3.6 or 3.7 installed.
Contents
1. Set up your environment
SaQC is written in Python, so the easiest way to set up your system to use SaQC for your needs is using the Python Package Index (PyPI). Following good Python practice, you will first want to create a new virtual environment that you install SaQC into by typing the following in your console:
# if you have not installed venv yet, do so:
python3 -m pip install --user virtualenv
# move to the directory where you want to create your virtual environment
cd YOURDIR
# create virtual environment called "env_saqc"
python3 -m venv env_saqc
# activate the virtual environment
source env_saqc/bin/activate
Note that these instructions are for Unix/Mac-systems, the commands will be a little different for Windows.
2. Get SaQC
Now get saqc via PyPI as well:
python -m pip install saqc
or download it directly from the GitLab-repository.
3. Training tour
The following passage guides you through the essentials of the usage of SaQC via a toy dataset and a toy configuration.
Get toy data and configuration
If you take a look into the folder saqc/ressources/data
you will find a toy
dataset data.csv
which contains the following:
Date,Battery,SM1,SM2
2016-04-01 00:05:48,3573,32.685,29.3157
2016-04-01 00:20:42,3572,32.7428,29.3157
2016-04-01 00:35:37,3572,32.6186,29.3679
2016-04-01 00:50:32,3572,32.736999999999995,29.3679
...
These are two timeseries of soil moisture (SM1+2) and the battery voltage of the measuring device over time. Generally, this is the way that your data should look like to run saqc. Note, however, that you do not necessarily need a series of dates to reference to and that you are free to use more columns of any name that you like.
Now create your our own configuration file saqc/ressources/data/myconfig.csv
and paste the following lines into it:
varname;test;plot
SM2;range(min=10, max=60);False
SM2;spikes_simpleMad(window="30d", z=3.5);True
These lines illustrate how different quality control tests can be specified for different variables by following the pattern:
varname | ; | testname (testparameters) | ; | plotting option |
---|
In this case, we define a range-test that flags all values outside the range [10,60] and a test to detect spikes using the MAD-method. You can find an overview of all available quality control tests in the documentation. Note that the tests are executed in the order that you define in the configuration file. The quality flags that are set during one test are always passed on to the subsequent one.
Run SaQC
Remember to have your virtual environment activated:
source env_saqc/bin/activate
Via your console, move into the folder you downloaded saqc into:
cd saqc
From here, you can run saqc and tell it to run the tests from the toy
config-file on the toy dataset via the -c
and -d
options:
saqc -c ressources/data/myconfig.csv -d ressources/data/data.csv
Which will output this plot:
So, what do we see here?
- The plot shows the data as well as the quality flags that were set by the
tests for the variable
SM2
, as defined in the config-file - Following our definition in the config-file, first the
range
-test that flags all values outside the range [10,60] was executed and after that, thespikes_simpleMad
-test to identify spikes in the data - In the config, we set the plotting option to
True
forspikes_simpleMad
, only. Thus, the plot aggregates all preceeding tests (here:range
) to black points and highlights the flags of the selected test as red points.
Configure SaQC
Change test parameters
Now you can start to change the settings in the config-file and investigate the
effect that has on how many datapoints are flagged as "BAD". When using your
own data, this is your way to configure the tests according to your needs. For
example, you could modify your myconfig.csv
and change the parameters of the
range-test:
varname;test;plot
SM2;range(min=-20, max=60);False
SM2;spikes_simpleMad(window="30d", z=3.5);True
Rerunning SaQC as above produces the following plot: