merged develop

00bdb3f4 · Peter Lünenschloß · 3da74aeb · 1aa24094 · 00bdb3f4 · 00bdb3f4
Commit 00bdb3f4 authored 3 years ago by Peter Lünenschloß
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -74,7 +74,7 @@ coverage:
 pages:
  stage: deploy
  only:
-    - cookBux
+    - develop
  except:
    - schedules
  script:

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
-# 1.1
-
-## Features
- register is importable from the top level module 
- flagIsolated now respects time gaps in addition to value numbers
- Make the comparator argument to isflagged available from the config
-
-
-## Bugfixes
- Fixed missing constant lookup in the evaluator
- Preserve untouched/checked variables and don't remove them from the data input
-
- 
-## Refactorings
--
-
-## Breaking Changes
-- 
-
-# 1.2
-
-## Features
- Python 3.8 support
- exe: added the dmp flagger option
- exe: use nodata argument as nodata-representation in output
- flagging functions: implemented flagging function aiming to flag invalid value raises in a given time range
- anaconda support
-
-## Bugfixes
- pass the harmonization function names to the flagger
- variables not listed in the varname column of the configuration file
-  were not available in generic tests
- Harmonization by interpolation, now will no longer insert a BAD-flagged but propperly interpolated value between two frequency alligned meassurements, that are seperated exactly by a margin of two times the frequency (instead, BAD flagged NaN gets inserted - as expected)
- Fixed "not a frequency" - bug, occuring when trying to aggregate values to a 1-unit-frequency (1 Day, 1 Hour, ...)
-
-## Refactorings
- configuration reader rework
-
-## Breaking Changes
-- 
-
-# 1.3
-
-## Features
- spike detection test `spikes_flagRaise`
- spike detection test `spikes_oddWater`
- generic processing function `procGeneric` 
-
-## Bugfixes
- configuration: certain whitespace patterns broke the configuration parsing
- configuration: multiple tests in one configuration row were not parsed correctly
- reader: variables only available within the flagger were not transformed correctly
-
-## Refactorings
- Improved logging
-
-## Breaking Changes
- configuration: quoted variable names are handled as regular expressions
- functions: renamed many test functions to a uniform naming scheme
-
-
-# 1.4
-
-## Features
- added the data processing module `proc_functions`
- `flagCrossValidation` implemented
- CLI: added support for parquet files
-
-## Bugfixes
- `spikes_flagRaise` - overestimation of value courses average fixed
- `spikes_flagRaise` - raise check window now closed on both sides
-
-## Refactorings
- renamed `spikes_oddWater` to `spikes_flagMultivarScores`
- added STRAY auto treshing algorithm to `spikes_flagMultivarScores`
- added "unflagging" - postprocess to `spikes_flagMultivarScores`
- improved and extended masking
-
-## Breaking Changes
- register is now a decorator instead of a wrapper
-
-# 1.5
-
-coming soon ...
-
-## Features
-
-## Bugfixes
-
-## Refactorings
-
-## Breaking Changes
+# Changelog
+
+This changelog starts with version 2.0.0. Basically all parts of the system, including the format of this changelog, have been reworked between the releases 1.4 and 2.0. Preceding the major breaking release 2.0, the maintenance of this file was rather sloppy, so we won't provide a detailled change history for early versions.
+
+
+## [Unreleased]
+### Added
+- The CLI now accepts remote configuration files given by an URL
+### Changed
+### Removed
+### Fixed
+- RDM/UFZ ogos:
+  - use the english versions of the respective images
+  - use full urls instead of the repo local urls in README.md
+- Fix the README.md code snippets
+- Fix version confusion
+- `copyField`: fix missleading error message
+- `flagGeneric`: fix failure on empty data
+
+## [2.0.0] - 2021-11-25
+This release marks the beginning of a new release cycle. Basically the entire system got reworked between versions 1.4 and 2.0, a detailed changelog is not recoverable and/or useful.
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -59,7 +59,7 @@ It is not a shame to name a parameter just `n` or `alpha` etc. if for example th
 - testnames: [testmodule_]flagTestName
 
 ## Formatting
-We use (black)[https://black.readthedocs.io/en/stable/] in its default settings.
+We use [black](https://black.readthedocs.io/en/stable/) in its default settings.
 Within the `saqc` root directory run `black .`.

 ## Imports

--- a/README.md
+++ b/README.md
 <a href="https://www.ufz.de/index.php?en=33573">
+<<<<<<< HEAD
    <img src="sphinx-doc/ressources/images/Representative/UFZLogo.png" width="400"/>
 </a>

 <a href="https://www.ufz.de/index.php?en=45348">
    <img src="sphinx-doc/ressources/images/Representative/RDMLogo.png" align="right" width="220"/>
+=======
+    <img src="https://git.ufz.de/rdm-software/saqc/raw/develop/sphinxdoc/ressources/images/Representative/UFZLogo.png" width="400"/>
+</a>
+
+<a href="https://www.ufz.de/index.php?en=45348">
+    <img src="https://git.ufz.de/rdm-software/saqc/raw/develop/sphinxdoc/ressources/images/Representative/RDMLogo.png" align="right" width="220"/>
+>>>>>>> develop
 </a>

 # System for automated Quality Control (SaQC)
@@ -42,24 +50,34 @@ and a python module with a simple API.
 The command line application is controlled by a semicolon-separated text
 file listing the variables in the dataset and the routines to inspect,
 quality control and/or process them. The content of such a configuration
-could look like this:
+could look like [this](https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/config.csv):

 ```
 varname    ; test
-#----------;------------------------------------
-SM2        ; shiftToFreq(freq="15Min")
-SM2        ; flagMissing()
-'SM(1|2)+' ; flagRange(min=10, max=60)
-SM2        ; flagMad(window="30d", z=3.5)
+#----------; -----------------------------------------------------
+SM2        ; shift(freq="15Min")
+'SM(1|2)+' ; flagMissing()
+SM1        ; flagRange(min=10, max=60)
+SM2        ; flagRange(min=10, max=40)
+SM2        ; flagMAD(window="30d", z=3.5)
+Dummy      ; flagGeneric(field=["SM1", "SM2"], func=(isflagged(x) | isflagged(y)))
 ```

 As soon as the basic inputs, dataset and configuration file, are
-prepared, `SaQC` is run with:
+prepared, run `SaQC`:
+```sh
+saqc \
+    --config PATH_TO_CONFIGURATION \
+    --data PATH_TO_DATA \
+    --outfile PATH_TO_OUTPUT
+```
+
+A full `SaQC` run against provided example data can be invoked with:
 ```sh
 saqc \
-    --config path_to_configuration.txt \
-    --data path_to_data.csv \
-    --outfile path_to_output.csv
+    --config https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/config.csv \
+    --data https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/data.csv \
+    --outfile saqc_test.csv
 ```

 ### SaQC as a python module
@@ -68,16 +86,22 @@ The following snippet implements the same configuration given above through
 the Python-API:

 ```python
-import numpy as np
+import pandas as pd
 from saqc import SaQC

-saqc = (SaQC(data)
-        .shiftToFreq("SM2", freq="15Min")
-        .flagMissing("SM2")
-        .flagRange("SM(1|2)+", regex=True, min=10, max=60)
-        .flagMad("SM2", window="30d", z=3.5))
-
-data, flags = saqc.getResult()
+data = pd.read_csv(
+    "https://git.ufz.de/rdm-software/saqc/raw/develop/ressources/data/data.csv",
+    index_col=0, parse_dates=True,
+)
+
+saqc = SaQC(data=data)
+saqc = (saqc
+        .shift("SM2", freq="15Min")
+        .flagMissing("SM(1|2)+", regex=True)
+        .flagRange("SM1", min=10, max=60)
+        .flagRange("SM2", min=10, max=40)
+        .flagMAD("SM2", window="30d", z=3.5)
+        .flagGeneric(field=["SM1", "SM2"], target="Dummy", func=lambda x, y: (isflagged(x) | isflagged(y))))
 ```

 A more detailed description of the Python API is available in the 

--- a/dios/requirements.txt
+++ b/dios/requirements.txt
 numpy==1.21.2
-pandas==1.3.3
-python-dateutil==2.8.1
-pytz==2021.1
+pandas==1.3.4
+python-dateutil==2.8.2
+pytz==2021.3
 six==1.16.0
--- a/requirements.txt
+++ b/requirements.txt
-Click==8.0.1
+Click==8.0.3
 dtw==1.4.0
-hypothesis==6.23.1
-matplotlib==3.4.3
-numba==0.54.0
+hypothesis==6.29.0
+matplotlib==3.5.0
+numba==0.54.1
 numpy==1.20.3
 outlier-utils==0.0.3
-pyarrow==4.0.1
-pandas==1.3.3
+pyarrow==6.0.1
+pandas==1.3.4
 pytest==6.2.5
 pytest-lazy-fixture==0.6.3
-scikit-learn==1.0
-scipy==1.7.1
-typing_extensions==3.10.0.2
+scikit-learn==1.0.1
+scipy==1.7.3
+typing_extensions==4.0.0
 seaborn==0.11.2
--- a/setup.py
+++ b/setup.py
 from setuptools import setup, find_packages
+from distutils.util import convert_path
+
+# read the version string from saqc without importing it. See the
+# link for a more detailed description of the problem and the solution
+# https://stackoverflow.com/questions/2058802/how-can-i-get-the-version-defined-in-setup-py-setuptools-in-my-package
+vdict = {}
+version_fpath = convert_path("saqc/version.py")
+with open(version_fpath) as f:
+    exec(f.read(), vdict)
+version = vdict["__version__"]

 with open("README.md", "r") as fh:
    long_description = fh.read()

 setup(
    name="saqc",
-    version="2.0.0",
+    version=version,
    author="Bert Palm, David Schaefer, Peter Luenenschloss, Lennard Schmidt",
    author_email="david.schaefer@ufz.de",
    description="Data quality checking and processing tool/framework",
@@ -20,10 +30,10 @@ setup(
        "scipy==1.7.*",
        "scikit-learn==1.0.*",
        "numba==0.54.*",
-        "matplotlib==3.4.*",
+        "matplotlib>=3.4,<3.6",
        "Click==8.0.*",
-        "pyarrow==4.0.*",
-        "typing_extensions==3.10.*",
+        "pyarrow==6.0.*",
+        "typing_extensions==4.*",
        "outlier-utils==0.0.3",
        "dtw==1.4.*",
        "seaborn==0.11.*",

--- a/tests/funcs/test_generic_api_functions.py
+++ b/tests/funcs/test_generic_api_functions.py
@@ -8,6 +8,7 @@ from dios.dios.dios import DictOfSeries
 from saqc.constants import BAD, UNFLAGGED, FILTER_ALL
 from saqc.core.flags import Flags
 from saqc import SaQC
+from saqc.core.register import _isflagged
 from saqc.lib.tools import toSequence

 from tests.common import initData
@@ -18,6 +19,19 @@ def data():
    return initData()


+def test_emptyData():
+    # test that things do not break with empty data sets
+    saqc = SaQC(data=pd.DataFrame({"x": [], "y": []}))
+
+    saqc.flagGeneric("x", func=lambda x: x < 0)
+    assert saqc.data.empty
+    assert saqc.flags.empty
+
+    saqc = saqc.processGeneric(field="x", target="y", func=lambda x: x + 2)
+    assert saqc.data.empty
+    assert saqc.flags.empty
+
+
 def test_writeTargetFlagGeneric(data):
    params = [
        (["tmp"], lambda x, y: pd.Series(True, index=x.index.union(y.index))),