Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • berntm/saqc
  • rdm-software/saqc
  • schueler/saqc
3 results
Show changes
Commits on Source (12)
Showing
with 194 additions and 76 deletions
......@@ -9,13 +9,18 @@ SPDX-License-Identifier: GPL-3.0-or-later
This changelog starts with version 2.0.0. Basically all parts of the system, including the format of this changelog, have been reworked between the releases 1.4 and 2.0. Preceding the major breaking release 2.0, the maintenance of this file was rather sloppy, so we won't provide a detailed change history for early versions.
## [Unreleased]
## Unreleased
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.0.1...develop)
### Added
### Changed
- `flagOffsets` parameters `thresh` and `thresh_relative` now both are optional
### Removed
### Fixed
- `flagOffset` bug with zero-valued threshold
## [2.0.1] - 2021-12-20
## [2.0.1](https://git.ufz.de/rdm-software/saqc/-/tags/v2.0.1) - 2021-12-20
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.0.0...v2.0.1)
### Added
- CLI now accepts remote configuration and data files as URL
- new function `transferFlags`
......@@ -41,5 +46,7 @@ This changelog starts with version 2.0.0. Basically all parts of the system, inc
- `field` was not masked for resampling functions
- allow custom registered functions to overwrite built-ins.
## [2.0.0] - 2021-11-25
## [2.0.0](https://git.ufz.de/rdm-software/saqc/-/tags/v2.0.0) - 2021-11-25
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v1.5.0...v2.0.0)
This release marks the beginning of a new release cycle. Basically the entire system got reworked between versions 1.4 and 2.0, a detailed changelog is not recoverable and/or useful.
......@@ -3,7 +3,7 @@ title: SaQC - System for automated Quality Control
message: "Please cite this software using these metadata."
type: software
version: 2.0.0
doi:
doi: https://doi.org/10.5281/zenodo.5888547
date-released: "2021-11-25"
license: "GPL-3.0"
repository-code: "https://git.ufz.de/rdm-software/saqc"
......@@ -24,7 +24,7 @@ authors:
affiliation: >-
Helmholtz Centre for Environmental Research -
UFZ
orcid: 'https://orcid.org/0000-0000-0000-0000'
orcid: 'https://orcid.org/0000-0001-5106-9057'
- given-names: Peter
family-names: Lünenschloß
email: peter.luenenschloss@ufz.de
......
......@@ -24,41 +24,38 @@ We implement the following naming conventions:
### Argument names in public functions signatures
first, its not necessary to have *talking* arg-names, in contrast to variable names in
code. This is, because one always must read the documentation. To use and parameterize a function,
just by guessing the meaning of the argument names and not read the docs,
will almost never work. thats why, we dont have the obligation to make names (very)
First, in contrast to variable names in code, it is not necessary to have *talking* function argument names.
A user is always expected to have had acknowledged the documentation. Using and parameterizing a function,
just by guessing the meaning of the argument names, without having read the documentation,
will almost never work. That is the reason, why we dont have the obligation to make names (very)
talkative.
second, because of the nature of a function (to have a *simple* way to use complex code),
its common to use simple and short names. This means, to omit any *irrelevant* information.
Second, from the nature of a function to deliver a *simple* way of using complex code, it follows, that simple and short names are to prefer. This means, the encoding of *irrelevant* informations in names should be omitted.
For example if we have a function that fit a polynomial on some data with three arguments.
For example, take a function of three arguments, that fits a polynomial to some data.
Lets say we have:
- the data input,
- a threshold that defines a cutoff point for a calculation on a polynomial and
- a threshold, that defines a cutoff point for a calculation on a polynomial and
- a third argument.
one could name the args `data, poly_cutoff_threshold, ...`, but much better names would
be `data, thresh, ...`, because a caller dont need the extra information,
stuffed in the name.
One could name the corresponding arguments: `data, poly_cutoff_threshold, ...`. However, much better names would
be: `data, thresh, ...`, because a caller that is aware of a functions documentation doesnt need the extra information,
encoded in the name.
If the third argument is also some kind of threshold,
one can use `data, cutoff, thresh`, because the *thresh-* information of the `cutoff`
parameter is not crucial and the caller knows that this is a threshold from the docstring.
parameter is not crucial, and the caller knows that this is a threshold from having studied the docstring, anyways.
third, underscores give a nice feedback if one doing wrong or over complex.
No underscore is fine, one underscore is ok, if the information is *really necessary* (see above),
but if one use two or more underscores, one should think of a better naming,
or omit some information.
Sure, seldom but sometimes it is necessary to use 2 underscores, but we consider it as bad style.
Using 3 or more underscores, is not allowed unless have write an reasoning and get it
signed by at least as many core developers as underscores one want to use.
Third, underscores give a nice implicit feedback, on whether one is doing wrong or getting over complex in the naming behavior.
To have no underscore, is just fine. Having one underscore, is ok, if the encoded information appended through the underscore is *really necessary* (see above).
If one uses two or more underscores, one should think of a better naming or omit some information.
Sure, although it is seldom, it might sometimes be necessary to use two underscores, but still the usage of two underscores is considered bad style.
Using three or more underscores is not allowed unless having issued a exhaustive and accepted (by at least one core developer per underscore) reasoning.
In short the naming should *give a very, very rough idea* of the purpose of the argument,
In short, the naming should *give a very, very rough idea* of the purpose of the argument,
but not *explain* the usage or the purpose.
It is not a shame to name a parameter just `n` or `alpha` etc. if for example the algorithm
(from the paper etc.) name it alike.
It is not a shame to name a parameter just `n` or `alpha` etc., if, for example, the algorithm
(from the paper etc.) names it alike.
### Test Functions
......
......@@ -112,9 +112,14 @@ of the documentation.
## Changelog
All notable changes to this project will be documented in [CHANGELOG.md](CHANGELOG.md).
## Contributing
## Get involved
### Contributing
You found a bug or you want to suggest some cool features? Please refer to our [contributing guidelines](CONTRIBUTING.md) to see how you can contribute to SaQC.
### User support
If you need help or have a question, you can use the SaQC user support mailing list: [saqc-support@ufz.de](mailto:saqc-support@ufz.de)
## Copyright and License
Copyright(c) 2021, [Helmholtz-Zentrum für Umweltforschung GmbH -- UFZ](https://www.ufz.de). All rights reserved.
......@@ -127,8 +132,9 @@ For full details, see [LICENSE](LICENSE.md).
...
## Publications
...
coming soon...
## How to cite SaQC
...
If SaQC is advancing your research, please cite as:
> Schäfer, David; Palm, Bert; Lünenschloß, Peter. (2021). System for automated Quality Control - SaQC. Zenodo. https://doi.org/10.5281/zenodo.5888547
......@@ -121,6 +121,6 @@ if __name__ == "__main__":
# t1 = time.time()
# print(t1-t0)
rr = [10 ** r for r in range(1, 6)]
rr = [10**r for r in range(1, 6)]
c = range(10, 60, 10)
gen_all(rr, c)
......@@ -36,7 +36,7 @@ if __name__ == "__main__":
)
def f(s):
sec = 10 ** 9
sec = 10**9
s.index = pd.to_datetime(s.index * sec)
return s
......
......@@ -72,7 +72,7 @@ def df_unaligned__():
def dios_fuzzy__(nr_cols=None, mincol=0, maxcol=10, itype=None):
nr_of_cols = nr_cols if nr_cols else randint(mincol, maxcol + 1)
ns = 10 ** 9
ns = 10**9
sec_per_year = 31536000
ITYPES = [IntItype, FloatItype, DtItype, ObjItype]
......
......@@ -23,7 +23,7 @@ class Breaks:
gap_window: str,
group_window: str,
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagIsolated", locals())
......@@ -34,6 +34,6 @@ class Breaks:
window: str,
min_periods: int = 1,
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagJumps", locals())
......@@ -20,7 +20,7 @@ class Constants:
maxna: int = None,
maxna_group: int = None,
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagByVariance", locals())
......
......@@ -22,6 +22,6 @@ class Curvefit:
window: Union[int, str],
order: int,
min_periods: int = 0,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("fitPolynomial", locals())
......@@ -32,7 +32,7 @@ class Drift:
/ len(x),
method: LinkageString = "single",
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagDriftFromNorm", locals())
......@@ -47,7 +47,7 @@ class Drift:
)
/ len(x),
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagDriftFromReference", locals())
......@@ -57,7 +57,7 @@ class Drift:
maintenance_field: str,
model: Callable[..., float] | Literal["linear", "exponential"],
cal_range: int = 5,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("correctDrift", locals())
......@@ -68,7 +68,7 @@ class Drift:
model: CurveFitter,
tolerance: Optional[str] = None,
epoch: bool = False,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("correctRegimeAnomaly", locals())
......@@ -80,6 +80,6 @@ class Drift:
window: str,
min_periods: int,
tolerance: Optional[str] = None,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("correctOffset", locals())
......@@ -26,6 +26,6 @@ class Noise:
sub_thresh: float = None,
min_periods: int = None,
flag: float = BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagByStatLowPass", locals())
......@@ -77,9 +77,9 @@ class Outliers:
def flagOffset(
self,
field: str,
thresh: float,
tolerance: float,
window: Union[int, str],
thresh: Optional[float] = None,
thresh_relative: Optional[float] = None,
flag: float = BAD,
**kwargs,
......
......@@ -20,6 +20,6 @@ class Pattern:
normalize=True,
plot=False,
flag=BAD,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("flagPatternByDTW", locals())
......@@ -24,7 +24,7 @@ class Residues:
window: Union[str, int],
order: int,
min_periods: Optional[int] = 0,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("calculatePolynomialResidues", locals())
......@@ -35,6 +35,6 @@ class Residues:
func: Callable[[pd.Series], np.ndarray] = np.mean,
min_periods: Optional[int] = 0,
center: bool = True,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("calculateRollingResidues", locals())
......@@ -20,6 +20,6 @@ class Transformation:
field: str,
func: Callable[[pd.Series], pd.Series],
freq: Optional[Union[float, str]] = None,
**kwargs
**kwargs,
) -> saqc.SaQC:
return self._defer("transform", locals())
......@@ -102,7 +102,7 @@ class PositionalScheme(TranslationScheme):
thist = flags.history[field].hist.replace(self._BACKWARD).astype(int)
# concatenate the single flag values
ncols = thist.shape[-1]
init = 9 * 10 ** ncols
init = 9 * 10**ncols
bases = 10 ** np.arange(ncols - 1, -1, -1)
tflags = init + (thist * bases).sum(axis=1)
......
......@@ -33,7 +33,7 @@ def fitPolynomial(
window: int | str,
order: int,
min_periods: int = 0,
**kwargs
**kwargs,
) -> Tuple[DictOfSeries, Flags]:
"""
Fits a polynomial model to the data.
......@@ -117,7 +117,7 @@ def _fitPolynomial(
set_flags: bool = True,
min_periods: int = 0,
return_residues: bool = False,
**kwargs
**kwargs,
) -> Tuple[DictOfSeries, Flags]:
# TODO: some (rather large) parts are functional similar to saqc.funcs.rolling.roll
......
......@@ -243,12 +243,12 @@ def _evalStrayLabels(
x = test_slice.index.values.astype(float)
x_0 = x[0]
x = (x - x_0) / 10 ** 12
x = (x - x_0) / 10**12
polyfitted = poly.polyfit(y=test_slice.values, x=x, deg=polydeg)
testval = poly.polyval(
(float(index[1].to_numpy()) - x_0) / 10 ** 12, polyfitted
(float(index[1].to_numpy()) - x_0) / 10**12, polyfitted
)
testval = val_frame[var][index[1]] - testval
......@@ -878,27 +878,29 @@ def flagOffset(
data: DictOfSeries,
field: str,
flags: Flags,
thresh: float,
tolerance: float,
window: Union[int, str],
thresh: Optional[float] = None,
thresh_relative: Optional[float] = None,
flag: float = BAD,
**kwargs,
) -> Tuple[DictOfSeries, Flags]:
"""
A basic outlier test that work on regular and irregular sampled data
A basic outlier test that works on regularly and irregularly sampled data.
The test classifies values/value courses as outliers by detecting not only a rise
in value, but also, checking for a return to the initial value level.
in value, but also, by checking for a return to the initial value level.
Values :math:`x_n, x_{n+1}, .... , x_{n+k}` of a timeseries :math:`x` with
associated timestamps :math:`t_n, t_{n+1}, .... , t_{n+k}` are considered spikes, if
1. :math:`|x_{n-1} - x_{n + s}| >` `thresh`, for all :math:`s \\in [0,1,2,...,k]`
2. :math:`|x_{n-1} - x_{n+k+1}| <` `tolerance`
2. :math:`(x_{n + s} - x_{n - 1}) / x_{n - 1} >` `thresh_relative`
3. :math:`|x_{n-1} - x_{n+k+1}| <` `tolerance`
3. :math:`|t_{n-1} - t_{n+k+1}| <` `window`
4. :math:`|t_{n-1} - t_{n+k+1}| <` `window`
Note, that this definition of a "spike" not only includes one-value outliers, but
also plateau-ish value courses.
......@@ -911,15 +913,19 @@ def flagOffset(
The field in data.
flags : saqc.Flags
Container to store flags of the data.
thresh : float
Minimum difference between to values, to consider the latter one as a spike. See condition (1)
tolerance : float
Maximum difference between pre-spike and post-spike values. See condition (2)
Maximum difference allowed, between the value, directly preceding and the value, directly succeeding an offset,
to trigger flagging of the values forming the offset.
See condition (3).
window : {str, int}, default '15min'
Maximum length of "spiky" value courses. See condition (3). Integer defined window length are only allowed for
regularly sampled timeseries.
Maximum length allowed for offset value courses, to trigger flagging of the values forming the offset.
See condition (4). Integer defined window length are only allowed for regularly sampled timeseries.
thresh : float: {float, None}, default None
Minimum difference between a value and its successors, to consider the successors an anomalous offset group.
See condition (1). If None is passed, condition (1) is not tested.
thresh_relative : {float, None}, default None
Relative threshold.
Minimum relative change between and its successors, to consider the successors an anomalous offset group.
See condition (2). If None is passed, condition (2) is not tested.
flag : float, default BAD
flag to set.
......@@ -931,6 +937,99 @@ def flagOffset(
The quality flags of data
Flags values may have changed, relatively to the flags input.
Examples
--------
.. plot::
:context:
:include-source: False
import matplotlib
import saqc
import pandas as pd
data = pd.DataFrame({'data':np.array([5,5,8,16,17,7,4,4,4,1,1,4])}, index=pd.date_range('2000',freq='1H', periods=12))
Lets generate a simple, regularly sampled timeseries with an hourly sampling rate and generate an
:py:class:`saqc.SaQC` instance from it.
.. doctest:: flagOffsetExample
>>> data = pd.DataFrame({'data':np.array([5,5,8,16,17,7,4,4,4,1,1,4])}, index=pd.date_range('2000',freq='1H', periods=12))
>>> data
data
2000-01-01 00:00:00 5
2000-01-01 01:00:00 5
2000-01-01 02:00:00 8
2000-01-01 03:00:00 16
2000-01-01 04:00:00 17
2000-01-01 05:00:00 7
2000-01-01 06:00:00 4
2000-01-01 07:00:00 4
2000-01-01 08:00:00 4
2000-01-01 09:00:00 1
2000-01-01 10:00:00 1
2000-01-01 11:00:00 4
>>> qc = saqc.SaQC(data)
Now we are applying :py:meth:`~saqc.SaQC.flagOffset` and try to flag offset courses, that dont extend longer than
*6 hours* in time (``window``) and that have an initial value jump higher than *2* (``thresh``), and that do return
to the initial value level within a tolerance of *1.5* (``tolerance``).
.. doctest:: flagOffsetExample
>>> qc = qc.flagOffset("data", thresh=2, tolerance=1.5, window='6H')
>>> qc.plot('data') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
>>> qc = saqc.SaQC(data)
>>> qc = qc.flagOffset("data", thresh=2, tolerance=1.5, window='6H')
>>> qc.plot('data')
Note, that both, negative and positive jumps are considered starting points of negative or positive offsets.
If you want to impose the additional condition, that the initial value jump must exceed *+90%* of the value level,
you can additionally set the ``thresh_relative`` parameter:
.. doctest:: flagOffsetExample
>>> qc = qc.flagOffset("data", thresh=2, thresh_relative=.9, tolerance=1.5, window='6H')
>>> qc.plot('data') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
>>> qc = saqc.SaQC(data)
>>> qc = qc.flagOffset("data", thresh=2, thresh_relative=.9, tolerance=1.5, window='6H')
>>> qc.plot('data')
Now, only positive jumps, that exceed a value gain of *+90%* are considered starting points of offsets.
In the same way, you can aim for only negative offsets, by setting a negative relative threshold. The below
example only flags offsets, that fall off by at least *50 %* in value, with an absolute value drop of at least *2*.
.. doctest:: flagOffsetExample
>>> qc = qc.flagOffset("data", thresh=2, thresh_relative=-.5, tolerance=1.5, window='6H')
>>> qc.plot('data') # doctest:+SKIP
.. plot::
:context: close-figs
:include-source: False
>>> qc = saqc.SaQC(data)
>>> qc = qc.flagOffset("data", thresh=2, thresh_relative=-.5, tolerance=1.5, window='6H')
>>> qc.plot('data')
References
----------
The implementation is a time-window based version of an outlier test from the UFZ Python library,
......@@ -939,6 +1038,12 @@ def flagOffset(
https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py
"""
if (thresh is None) and (thresh_relative is None):
raise ValueError(
"At least one of parameters 'thresh' and 'thresh_relative' has to be given. Got 'thresh'=None, "
"'thresh_relative'=None instead."
)
dataseries = data[field].dropna()
if dataseries.empty:
return data, flags
......@@ -954,19 +1059,19 @@ def flagOffset(
window = delta * window
if not delta:
raise TypeError(
"Only offset string defined window sizes allowed for irrgegularily sampled timeseries"
"Only offset string defined window sizes allowed for timeseries not sampled regularly."
)
# get all the entries preceding a significant jump
if thresh:
if thresh is not None:
post_jumps = dataseries.diff().abs() > thresh
if thresh_relative:
if thresh_relative is not None:
s = np.sign(thresh_relative)
rel_jumps = s * (dataseries.shift(1) - dataseries).div(dataseries.abs()) > abs(
thresh_relative
)
if thresh:
if thresh is not None:
post_jumps = rel_jumps & post_jumps
else:
post_jumps = rel_jumps
......@@ -982,11 +1087,13 @@ def flagOffset(
).dropna()
to_roll = dataseries[to_roll]
if thresh_relative:
if thresh_relative is not None:
def spikeTester(chunk, thresh=abs(thresh_relative), tol=tolerance):
def spikeTester(
chunk, thresh_r=abs(thresh_relative), thresh_a=thresh or 0, tol=tolerance
):
jump = chunk[-2] - chunk[-1]
thresh = thresh * abs(jump)
thresh = max(thresh_r * abs(chunk[-1]), thresh_a)
chunk_stair = (np.sign(jump) * (chunk - chunk[-1]) < thresh)[::-1].cumsum()
initial = np.searchsorted(chunk_stair, 2)
if initial == len(chunk):
......
......@@ -5,8 +5,9 @@
# SPDX-License-Identifier: GPL-3.0-or-later
# -*- coding: utf-8 -*-
from __future__ import annotations
from typing import Optional, Tuple, Union
from typing import Optional, Tuple
from typing_extensions import Literal
import numpy as np
......@@ -264,7 +265,7 @@ def plot(
flags: Flags,
path: Optional[str] = None,
max_gap: Optional[str] = None,
history: Optional[Literal["valid", "complete", "clear"]] = "valid",
history: Optional[Literal["valid", "complete"] | list] = "valid",
xscope: Optional[slice] = None,
phaseplot: Optional[str] = None,
store_kwargs: Optional[dict] = None,
......@@ -304,14 +305,14 @@ def plot(
before plotting. If an offset string is passed, only points that have a distance
below `max_gap` get connected via the plotting line.
history : {"valid", "complete", None}, default "valid"
history : {"valid", "complete", None, list of strings}, default "valid"
Discriminate the plotted flags with respect to the tests they originate from.
* "valid" - Only plot those flags, that do not get altered or "unflagged" by subsequent tests. Only list tests
in the legend, that actually contributed flags to the overall resault.
* "complete" - plot all the flags set and list all the tests ran on a variable. Suitable for debugging/tracking.
* "clear" - clear plot from all the flagged values
* None - just plot the resulting flags for one variable, without any historical meta information.
* list of strings - plot only flags set by those tests listed.
xscope : slice or Offset, default None
Parameter, that determines a chunk of the data to be plotted
......@@ -328,7 +329,7 @@ def plot(
ax_kwargs : dict, default {}
Axis keywords. Change the axis labeling defaults. Most important keywords:
'x_label', 'y_label', 'title', 'fontsize'.
'x_label', 'y_label', 'title', 'fontsize', 'cycleskip'.
"""
......