Skip to content
Snippets Groups Projects
Commit 00bdb3f4 authored by Peter Lünenschloß's avatar Peter Lünenschloß
Browse files

merged develop

parents 3da74aeb 1aa24094
No related branches found
No related tags found
5 merge requests!685Release 2.4,!684Release 2.4,!567Release 2.2.1,!566Release 2.2,!501Release 2.1
Pipeline #56137 passed with stage
in 1 minute and 50 seconds
......@@ -74,7 +74,7 @@ coverage:
pages:
stage: deploy
only:
- cookBux
- develop
except:
- schedules
script:
......
# 1.1
## Features
- register is importable from the top level module
- flagIsolated now respects time gaps in addition to value numbers
- Make the comparator argument to isflagged available from the config
## Bugfixes
- Fixed missing constant lookup in the evaluator
- Preserve untouched/checked variables and don't remove them from the data input
## Refactorings
--
## Breaking Changes
--
# 1.2
## Features
- Python 3.8 support
- exe: added the dmp flagger option
- exe: use nodata argument as nodata-representation in output
- flagging functions: implemented flagging function aiming to flag invalid value raises in a given time range
- anaconda support
## Bugfixes
- pass the harmonization function names to the flagger
- variables not listed in the varname column of the configuration file
were not available in generic tests
- Harmonization by interpolation, now will no longer insert a BAD-flagged but propperly interpolated value between two frequency alligned meassurements, that are seperated exactly by a margin of two times the frequency (instead, BAD flagged NaN gets inserted - as expected)
- Fixed "not a frequency" - bug, occuring when trying to aggregate values to a 1-unit-frequency (1 Day, 1 Hour, ...)
## Refactorings
- configuration reader rework
## Breaking Changes
--
# 1.3
## Features
- spike detection test `spikes_flagRaise`
- spike detection test `spikes_oddWater`
- generic processing function `procGeneric`
## Bugfixes
- configuration: certain whitespace patterns broke the configuration parsing
- configuration: multiple tests in one configuration row were not parsed correctly
- reader: variables only available within the flagger were not transformed correctly
## Refactorings
- Improved logging
## Breaking Changes
- configuration: quoted variable names are handled as regular expressions
- functions: renamed many test functions to a uniform naming scheme
# 1.4
## Features
- added the data processing module `proc_functions`
- `flagCrossValidation` implemented
- CLI: added support for parquet files
## Bugfixes
- `spikes_flagRaise` - overestimation of value courses average fixed
- `spikes_flagRaise` - raise check window now closed on both sides
## Refactorings
- renamed `spikes_oddWater` to `spikes_flagMultivarScores`
- added STRAY auto treshing algorithm to `spikes_flagMultivarScores`
- added "unflagging" - postprocess to `spikes_flagMultivarScores`
- improved and extended masking
## Breaking Changes
- register is now a decorator instead of a wrapper
# 1.5
coming soon ...
## Features
## Bugfixes
## Refactorings
## Breaking Changes
# Changelog
This changelog starts with version 2.0.0. Basically all parts of the system, including the format of this changelog, have been reworked between the releases 1.4 and 2.0. Preceding the major breaking release 2.0, the maintenance of this file was rather sloppy, so we won't provide a detailled change history for early versions.
## [Unreleased]
### Added
- The CLI now accepts remote configuration files given by an URL
### Changed
### Removed
### Fixed
- RDM/UFZ ogos:
- use the english versions of the respective images
- use full urls instead of the repo local urls in README.md
- Fix the README.md code snippets
- Fix version confusion
- `copyField`: fix missleading error message
- `flagGeneric`: fix failure on empty data
## [2.0.0] - 2021-11-25
This release marks the beginning of a new release cycle. Basically the entire system got reworked between versions 1.4 and 2.0, a detailed changelog is not recoverable and/or useful.
......@@ -59,7 +59,7 @@ It is not a shame to name a parameter just `n` or `alpha` etc. if for example th
- testnames: [testmodule_]flagTestName
## Formatting
We use (black)[https://black.readthedocs.io/en/stable/] in its default settings.
We use [black](https://black.readthedocs.io/en/stable/) in its default settings.
Within the `saqc` root directory run `black .`.
## Imports
......
<a href="https://www.ufz.de/index.php?en=33573">
<<<<<<< HEAD
<img src="sphinx-doc/ressources/images/Representative/UFZLogo.png" width="400"/>
</a>
<a href="https://www.ufz.de/index.php?en=45348">
<img src="sphinx-doc/ressources/images/Representative/RDMLogo.png" align="right" width="220"/>
=======
<img src="https://git.ufz.de/rdm-software/saqc/raw/develop/sphinxdoc/ressources/images/Representative/UFZLogo.png" width="400"/>
</a>
<a href="https://www.ufz.de/index.php?en=45348">
<img src="https://git.ufz.de/rdm-software/saqc/raw/develop/sphinxdoc/ressources/images/Representative/RDMLogo.png" align="right" width="220"/>
>>>>>>> develop
</a>
# System for automated Quality Control (SaQC)
......@@ -42,24 +50,34 @@ and a python module with a simple API.
The command line application is controlled by a semicolon-separated text
file listing the variables in the dataset and the routines to inspect,
quality control and/or process them. The content of such a configuration
could look like this:
could look like [this](https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/config.csv):
```
varname ; test
#----------;------------------------------------
SM2 ; shiftToFreq(freq="15Min")
SM2 ; flagMissing()
'SM(1|2)+' ; flagRange(min=10, max=60)
SM2 ; flagMad(window="30d", z=3.5)
#----------; -----------------------------------------------------
SM2 ; shift(freq="15Min")
'SM(1|2)+' ; flagMissing()
SM1 ; flagRange(min=10, max=60)
SM2 ; flagRange(min=10, max=40)
SM2 ; flagMAD(window="30d", z=3.5)
Dummy ; flagGeneric(field=["SM1", "SM2"], func=(isflagged(x) | isflagged(y)))
```
As soon as the basic inputs, dataset and configuration file, are
prepared, `SaQC` is run with:
prepared, run `SaQC`:
```sh
saqc \
--config PATH_TO_CONFIGURATION \
--data PATH_TO_DATA \
--outfile PATH_TO_OUTPUT
```
A full `SaQC` run against provided example data can be invoked with:
```sh
saqc \
--config path_to_configuration.txt \
--data path_to_data.csv \
--outfile path_to_output.csv
--config https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/config.csv \
--data https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/data.csv \
--outfile saqc_test.csv
```
### SaQC as a python module
......@@ -68,16 +86,22 @@ The following snippet implements the same configuration given above through
the Python-API:
```python
import numpy as np
import pandas as pd
from saqc import SaQC
saqc = (SaQC(data)
.shiftToFreq("SM2", freq="15Min")
.flagMissing("SM2")
.flagRange("SM(1|2)+", regex=True, min=10, max=60)
.flagMad("SM2", window="30d", z=3.5))
data, flags = saqc.getResult()
data = pd.read_csv(
"https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/data.csv",
index_col=0, parse_dates=True,
)
saqc = SaQC(data=data)
saqc = (saqc
.shift("SM2", freq="15Min")
.flagMissing("SM(1|2)+", regex=True)
.flagRange("SM1", min=10, max=60)
.flagRange("SM2", min=10, max=40)
.flagMAD("SM2", window="30d", z=3.5)
.flagGeneric(field=["SM1", "SM2"], target="Dummy", func=lambda x, y: (isflagged(x) | isflagged(y))))
```
A more detailed description of the Python API is available in the
......
numpy==1.21.2
pandas==1.3.3
python-dateutil==2.8.1
pytz==2021.1
pandas==1.3.4
python-dateutil==2.8.2
pytz==2021.3
six==1.16.0
Click==8.0.1
Click==8.0.3
dtw==1.4.0
hypothesis==6.23.1
matplotlib==3.4.3
numba==0.54.0
hypothesis==6.29.0
matplotlib==3.5.0
numba==0.54.1
numpy==1.20.3
outlier-utils==0.0.3
pyarrow==4.0.1
pandas==1.3.3
pyarrow==6.0.1
pandas==1.3.4
pytest==6.2.5
pytest-lazy-fixture==0.6.3
scikit-learn==1.0
scipy==1.7.1
typing_extensions==3.10.0.2
scikit-learn==1.0.1
scipy==1.7.3
typing_extensions==4.0.0
seaborn==0.11.2
from setuptools import setup, find_packages
from distutils.util import convert_path
# read the version string from saqc without importing it. See the
# link for a more detailed description of the problem and the solution
# https://stackoverflow.com/questions/2058802/how-can-i-get-the-version-defined-in-setup-py-setuptools-in-my-package
vdict = {}
version_fpath = convert_path("saqc/version.py")
with open(version_fpath) as f:
exec(f.read(), vdict)
version = vdict["__version__"]
with open("README.md", "r") as fh:
long_description = fh.read()
setup(
name="saqc",
version="2.0.0",
version=version,
author="Bert Palm, David Schaefer, Peter Luenenschloss, Lennard Schmidt",
author_email="david.schaefer@ufz.de",
description="Data quality checking and processing tool/framework",
......@@ -20,10 +30,10 @@ setup(
"scipy==1.7.*",
"scikit-learn==1.0.*",
"numba==0.54.*",
"matplotlib==3.4.*",
"matplotlib>=3.4,<3.6",
"Click==8.0.*",
"pyarrow==4.0.*",
"typing_extensions==3.10.*",
"pyarrow==6.0.*",
"typing_extensions==4.*",
"outlier-utils==0.0.3",
"dtw==1.4.*",
"seaborn==0.11.*",
......
......@@ -8,6 +8,7 @@ from dios.dios.dios import DictOfSeries
from saqc.constants import BAD, UNFLAGGED, FILTER_ALL
from saqc.core.flags import Flags
from saqc import SaQC
from saqc.core.register import _isflagged
from saqc.lib.tools import toSequence
from tests.common import initData
......@@ -18,6 +19,19 @@ def data():
return initData()
def test_emptyData():
# test that things do not break with empty data sets
saqc = SaQC(data=pd.DataFrame({"x": [], "y": []}))
saqc.flagGeneric("x", func=lambda x: x < 0)
assert saqc.data.empty
assert saqc.flags.empty
saqc = saqc.processGeneric(field="x", target="y", func=lambda x: x + 2)
assert saqc.data.empty
assert saqc.flags.empty
def test_writeTargetFlagGeneric(data):
params = [
(["tmp"], lambda x, y: pd.Series(True, index=x.index.union(y.index))),
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment