diff --git a/docs/cookbooks/DriftDetection.rst b/docs/cookbooks/DriftDetection.rst deleted file mode 100644 index 145247e74529845c9224c94b61a98f18a5558cbe..0000000000000000000000000000000000000000 --- a/docs/cookbooks/DriftDetection.rst +++ /dev/null @@ -1,233 +0,0 @@ -.. SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ -.. -.. SPDX-License-Identifier: GPL-3.0-or-later - - -Drift Detection -=============== - - - -Overview --------- - -The guide briefly introduces the usage of the :py:meth:`~saqc.SaQC.flagDriftFromNorm` method. -The method detects sections in timeseries that deviate from the majority in a group of variables - - -* :ref:`Parameters <cookbooks/DriftDetection:Parameters>` -* :ref:`Algorithm <cookbooks/DriftDetection:Algorithm>` -* :ref:`Example Data import <cookbooks/DriftDetection:Example Data import>` -* :ref:`Example Algorithm Application <cookbooks/DriftDetection:Example Algorithm Application>` - - - - -Parameters ----------- - -Although there seems to be a lot of user input to parametrize, most of it is easy to be interpreted and can be selected -defaultly. - -window -^^^^^^ - -Length of the partitions the target group of data series` is divided into. -For example, if selected ``1D`` (one day), the group to check will be divided into one day chunks and every chunk is be checked for time series deviating from the normal group. - -frac -^^^^ - -The percentage of data, needed to define the "normal" group expressed in a number out of :math:`[0,1]`. -This, of course must be something over 50 percent (math:`0.5`), and can be -selected according to the number of drifting variables one expects the data to have at max. - -method -^^^^^^ - -The linkage method can have some impact on the clustering, but sticking to the default value `single` might be -sufficient for most the tasks. - -spread -^^^^^^ - -The main parameter to control the algorithm's behavior. It has to be selected carefully. -It determines the maximum spread of a normal group by limiting the costs, a cluster agglomeration must not exceed in -every linkage step. - -For singleton clusters, that costs equals half the distance, the timeseries in the clusters have to each other. So, only timeseries with a distance of less than two times the spreading norm can be clustered. - -When timeseries get clustered together, this new clusters distance to all the other timeseries/clusters is calculated -according to the linkage method specified. By default, it is the minimum distance, the members of the clusters have to -each other. - -Having that in mind, it is advisable to choose a distance function as metric, that can be well interpreted in the units -dimension of the measurement, and where the interpretation is invariant over the length of the timeseries. - -metric -^^^^^^ - -The default *averaged manhatten metric* roughly represents the averaged value distance of two timeseries (as opposed to *euclidean*, which scales non linearly with the -compared timeseries' length). For the selection of the :py:attr:`spread` parameter the default metric is helpful, since it allows to interpret the spreading in the dimension of the measurements. - - -Algorithm ---------- - -The aim of the algorithm is to flag sections in timeseries, that significantly deviate from a normal group of timeseries running in parallel within a given section. - -"Normality" is determined in terms of a maximum spreading distance, that members of a normal group must not exceed. -In addition, a group is only considered to be "normal", if it contains more then a certain percentage of the timeseries to be clustered into "normal" ones and "abnormal" ones. - -The steps of the algorithm are the following: - -* Calculate the distances :math:`d(x_i,x_j)` for all timeseries :math:`x_i` that are to be clustered with a metric specified by the user -* Calculate a dendogram using a hierarchical linkage algorithm, specified by the user. -* Flatten the dendogram at the level, the agglomeration costs exceed the value given by a spreading norm, specified by the user -* check if there is a cluster containing more than a certain percentage of variables as specified by the user. - * if yes: flag all the variables that are not in that cluster - * if no: flag nothing - -Example Data Import -------------------- - -.. plot:: - :context: reset - :include-source: False - - import matplotlib - import saqc - import pandas as pd - data = pd.read_csv('../resources/data/tempSensorGroup.csv', index_col=0) - data.index = pd.DatetimeIndex(data.index) - qc = saqc.SaQC(data) - -We load the example `data set <https://git.ufz.de/rdm-software/saqc/-/blob/develop/docs/resources/data/tempsenorGroup.csv>`_ -from the *saqc* repository using the `pandas <https://pandas.pydata.org/>`_ csv -file reader. Subsequently, we cast the index of the imported data to `DatetimeIndex` -and use the dataframe's `plot` method, to inspect the imported data: - -.. doctest:: flagDriftFromNorm - - >>> data = pd.read_csv('./resources/data/tempSensorGroup.csv', index_col=0) - >>> data.index = pd.DatetimeIndex(data.index) - >>> data.plot() # doctest: +SKIP - - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - data.plot() - - -Example Algorithm Application ------------------------------ - -Looking at our example data set more closely, we see that 2 of the 5 variables start to drift away. - -.. plot:: - :context: close-figs - :include-source: False - :class: center - :caption: 2 variables start departing the majority group of variables (the group containing more than ``frac`` variables) around july. - - data['2017-05':'2017-11'].plot() - - -.. plot:: - :context: close-figs - :include-source: False - :class: center - :caption: 2 variables are departed from the majority group of variables (the group containing more than ``frac`` variables) by the end of the year. - - data['2017-09':'2018-01'].plot() - -Lets try to detect those drifts via saqc. There for we import the *saqc* package and instantiate a :py:class:`saqc.SaQC` -object with the data: - -.. doctest:: flagDriftFromNorm - - >>> import saqc - >>> qc = saqc.SaQC(data) - -The changes we observe in the data seem to develop significantly only in temporal spans over a month, -so we go for ``"1M"`` as value for the -``window`` parameter. We identified the majority group as a group containing three variables, whereby two variables -seem to be scattered away, so that we can leave the ``frac`` value at its default ``.5`` level. -The majority group seems on average not to be spread out more than 3 or 4 degrees. So, for the ``spread`` value -we go for ``3``. This can be interpreted as follows, for every member of a group, there is another member that -is not distanted more than ``3`` degrees from that one (on average in one month) - this should be sufficient to bundle -the majority group and to discriminate against the drifting variables, that seem to deviate more than 3 degrees on -average in a month from any member of the majority group. - -.. doctest:: flagDriftFromNorm - - >>> variables = ['temp1 [degC]', 'temp2 [degC]', 'temp3 [degC]', 'temp4 [degC]', 'temp5 [degC]'] - >>> qc = qc.flagDriftFromNorm(variables, window='1M', spread=3) - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - >>> variables = ['temp1 [degC]', 'temp2 [degC]', 'temp3 [degC]', 'temp4 [degC]', 'temp5 [degC]'] - >>> qc = qc.flagDriftFromNorm(variables, window='1M', spread=3) - -Lets check the results: - -.. doctest:: flagDriftFromNorm - - >>> qc.plot('temp1 [degC]') # doctest: +SKIP - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - qc.plot('temp1 [degC]') - -.. doctest:: flagDriftFromNorm - - >>> qc.plot('temp2 [degC]') # doctest: +SKIP - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - qc.plot('temp2 [degC]') - -.. doctest:: flagDriftFromNorm - - >>> qc.plot('temp3 [degC]') # doctest: +SKIP - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - qc.plot('temp3 [degC]') - -.. doctest:: flagDriftFromNorm - - >>> qc.plot('temp4 [degC]') # doctest: +SKIP - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - qc.plot('temp4 [degC]') - -.. doctest:: flagDriftFromNorm - - >>> qc.plot('temp5 [degC]') # doctest: +SKIP - -.. plot:: - :context: close-figs - :include-source: False - :class: center - - qc.plot('temp5 [degC]') \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 731bee3213b6feef434b87216ac661cdecfb8bdf..ee9cb04b8285b5dd377dafebf6f8fa93cc0f7a83 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,7 +6,7 @@ Click==8.1.3 docstring_parser==0.15 dtw==1.4.0 matplotlib==3.7.1 -numpy==1.23.5 +numpy==1.24.3 outlier-utils==0.0.3 pyarrow==11.0.0 pandas==2.0.1 diff --git a/saqc/funcs/interpolation.py b/saqc/funcs/interpolation.py index aa70c0e419d605d8cf597bee24e596119b1c9195..6682ea5921849a38d2edfb1fe5baf66d5ee040a0 100644 --- a/saqc/funcs/interpolation.py +++ b/saqc/funcs/interpolation.py @@ -255,19 +255,19 @@ class InterpolationMixin: <BLANKLINE> """ - if "window" in kwargs: + if "freq" in kwargs: # the old interpolate version warnings.warn( f""" The method `intepolate` is deprecated and will be removed in version 3.0 of saqc. To achieve the same behaviour please use: - `qc.align(field={field}, window={kwargs["window"]}, method={method}, order={order}, flag={flag})` + `qc.align(field={field}, freq={kwargs["freq"]}, method={method}, order={order}, flag={flag})` """, DeprecationWarning, ) return self.align( field=field, - freq=kwargs.pop("window", method), + freq=kwargs.pop("freq", method), method=method, order=order, flag=flag, @@ -370,7 +370,7 @@ class InterpolationMixin: "func": "align", "args": (field,), "kwargs": { - "window": freq, + "freq": freq, "method": method, "order": order, "extrapolate": extrapolate, @@ -437,7 +437,7 @@ class InterpolationMixin: The method `interpolateIndex` is deprecated and will be removed in verion 3.0 of saqc. To achieve the same behavior use: """ - call = "qc.align(field={field}, window={window}, method={method}, order={order}, extrapolate={extrapolate})" + call = "qc.align(field={field}, freq={freq}, method={method}, order={order}, extrapolate={extrapolate})" if limit != 2: call = f"{call}.interpolate(field={field}, method={method}, order={order}, limit={limit}, extrapolate={extrapolate})" @@ -524,7 +524,7 @@ def _shift( method : Method to propagate values: - * 'nshift' : shift grid points to the nearest time stamp in the range = +/- 0.5 * ``window`` + * 'nshift' : shift grid points to the nearest time stamp in the range = +/- 0.5 * ``freq`` * 'bshift' : shift grid points to the first succeeding time stamp (if any) * 'fshift' : shift grid points to the last preceeding time stamp (if any) @@ -580,7 +580,7 @@ def _interpolate( datcol = datcol[~flagged].dropna() # account for annoying case of subsequent frequency aligned values, - # that differ exactly by the margin of 2*window + # that differ exactly by the margin of 2*freq gaps = datcol.index[1:] - datcol.index[:-1] == 2 * pd.Timedelta(freq) gaps = datcol.index[1:][gaps] gaps = gaps.intersection(grid_index).shift(-1, freq) diff --git a/saqc/funcs/resampling.py b/saqc/funcs/resampling.py index b72d0f13b40c0d827df5f5753b727cc1a63fb043..3f397133b24b3075f8687ff3d89fec49c3c07c55 100644 --- a/saqc/funcs/resampling.py +++ b/saqc/funcs/resampling.py @@ -97,7 +97,7 @@ class ResamplingMixin: method : Method to propagate values: - * 'nshift' : shift grid points to the nearest time stamp in the range = +/- 0.5 * ``window`` + * 'nshift' : shift grid points to the nearest time stamp in the range = +/- 0.5 * ``freq`` * 'bshift' : shift grid points to the first succeeding time stamp (if any) * 'fshift' : shift grid points to the last preceeding time stamp (if any) """ @@ -105,7 +105,7 @@ class ResamplingMixin: f""" The method `shift` is deprecated and will be removed with version 2.6 of saqc. To achieve the same behavior please use: - `qc.align(field={field}, window={freq}. method={method})` + `qc.align(field={field}, freq={freq}. method={method})` """, DeprecationWarning, ) @@ -131,7 +131,7 @@ class ResamplingMixin: ``func``, the result is projected to the new timestamps using ``method``. The following methods are available: - * ``'nagg'``: all values in the range (+/- `window`/2) of a grid point get + * ``'nagg'``: all values in the range (+/- `freq`/2) of a grid point get aggregated with func and assigned to it. * ``'bagg'``: all values in a sampling interval get aggregated with func and the result gets assigned to the last grid point. @@ -200,7 +200,7 @@ class ResamplingMixin: "func": "resample", "args": (), "kwargs": { - "window": freq, + "freq": freq, "func": func, "method": method, "maxna": maxna, @@ -306,7 +306,7 @@ class ResamplingMixin: if freq is None and not method == "match": raise ValueError( 'To project irregularly sampled data, either use method="match", or ' - "pass custom projection range to window parameter." + "pass custom projection range to freq parameter." ) if method == "auto": @@ -373,7 +373,7 @@ class ResamplingMixin: "field": field, "target": target, "method": method, - "window": freq, + "freq": freq, "drop": drop, "squeeze": squeeze, "overwrite": overwrite, diff --git a/saqc/funcs/tools.py b/saqc/funcs/tools.py index 42cdb6e9acd49aa880a7791520793b5e6b9d50c0..858c09a72172190e8936350af68fe8cb8a8a0a4a 100644 --- a/saqc/funcs/tools.py +++ b/saqc/funcs/tools.py @@ -198,14 +198,14 @@ class ToolsMixin: def plot( self: "SaQC", field: str, - path: Optional[str] = None, - max_gap: Optional[str] = None, - history: Optional[Literal["valid", "complete"] | list] = "valid", - xscope: Optional[slice] = None, - phaseplot: Optional[str] = None, - store_kwargs: Optional[dict] = None, + path: str | None = None, + max_gap: str | None = None, + history: Literal["valid", "complete"] | list[str] | None = "valid", + xscope: slice | None = None, + phaseplot: str | None = None, + store_kwargs: dict | None = None, ax: mpl.axes.Axes | None = None, - ax_kwargs: Optional[dict] = None, + ax_kwargs: dict | None = None, dfilter: float = FILTER_NONE, **kwargs, ) -> "SaQC": @@ -227,36 +227,45 @@ class ToolsMixin: the plot is stored unter the passed location. max_gap : - If None, all the points in the data will be connected, resulting in long linear - lines, where continous chunks of data is missing. Nans in the data get dropped - before plotting. If an offset string is passed, only points that have a distance - below `max_gap` get connected via the plotting line. + If ``None``, all data points will be connected, resulting in long linear + lines, in case of large data gaps. ``NaN`` values will be removed before + plotting. If an offset string is passed, only points that have a distance + below ``max_gap`` are connected via the plotting line. history : Discriminate the plotted flags with respect to the tests they originate from. - * "valid" - Only plot those flags, that do not get altered or "unflagged" by subsequent tests. Only list tests - in the legend, that actually contributed flags to the overall resault. - * "complete" - plot all the flags set and list all the tests ran on a variable. Suitable for debugging/tracking. - * None - just plot the resulting flags for one variable, without any historical meta information. - * list of strings - plot only flags set by those tests listed. + * ``"valid"``: Only plot flags, that are not overwritten by subsequent tests. + Only list tests in the legend, that actually contributed flags to the overall + result. + * ``"complete"``: Plot all flags set and list all the tests executed on a variable. + Suitable for debugging/tracking. + * ``None``: Just plot the resulting flags for one variable, without any historical + and/or meta information. + * list of strings: List of tests. Plot flags from the given tests, only. xscope : - Parameter, that determines a chunk of the data to be plotted - processed. `xscope` can be anything, that is a valid argument to the ``pandas.Series.__getitem__`` method. + Determine a chunk of the data to be plotted processed. ``xscope`` can be anything, + that is a valid argument to the ``pandas.Series.__getitem__`` method. phaseplot : - If a string is passed, plot ``field`` in the phase space it forms together with the Variable ``phaseplot``. + If a string is passed, plot ``field`` in the phase space it forms together with the + variable ``phaseplot``. + + ax : + If not ``None``, plot into the given ``matplotlib.Axes`` instance, instead of a + newly created ``matplotlib.Figure``. This option offers a possibility to integrate + ``SaQC`` plots into custom figure layouts. store_kwargs : Keywords to be passed on to the ``matplotlib.pyplot.savefig`` method, handling the figure storing. To store an pickle object of the figure, use the option - ``{'pickle': True}``, but note that all other store_kwargs are ignored then. - Reopen with: ``pickle.load(open(savepath,'w')).show()`` + ``{"pickle": True}``, but note that all other ``store_kwargs`` are ignored then. + To reopen a pickled figure execute: ``pickle.load(open(savepath, "w")).show()`` ax_kwargs : Axis keywords. Change the axis labeling defaults. Most important keywords: - 'x_label', 'y_label', 'title', 'fontsize', 'cycleskip'. + ``"xlabel"``, ``"ylabel"``, ``"title"``, ``"fontsize"``, ``"cycleskip"``. """ data, flags = self._data.copy(), self._flags.copy() diff --git a/saqc/lib/exceptions.py b/saqc/lib/exceptions.py deleted file mode 100644 index 1b8748fd3c285826a84cbb986539f92700676ba3..0000000000000000000000000000000000000000 --- a/saqc/lib/exceptions.py +++ /dev/null @@ -1,50 +0,0 @@ -#! /usr/bin/env python - -# SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ -# -# SPDX-License-Identifier: GPL-3.0-or-later - -# -*- coding: utf-8 -*- -from __future__ import annotations - -from typing import ( - Any, - Callable, - Collection, - List, - Literal, - Sequence, - Tuple, - TypeVar, - Union, -) - -CLOSURE_TO_NOTION = { - None: "interval ({}, {})", - "left": "right-open interval [{}, {})]", - "right": "left-open interval ({}, {}]", - "both": "closed interval [{}, {}]", -} - - -class ParameterOutOfBounds(Exception): - def __init__( - self, - value: int | float, - para_name: str, - bounds: Tuple[str], - closed: Literal["right", "left", "both"] = None, - ): - Exception.__init__(self) - self.value = value - self.para_name = para_name - self.bounds = bounds - self.closed = closed - self.msg = "Parameter '{}' has to be in the {}, but {} was passed." - - def __str__(self): - return self.msg.format( - self.para_name, - CLOSURE_TO_NOTION[self.closed].format(self.bounds[0], self.bounds[1]), - self.value, - ) diff --git a/saqc/lib/tools.py b/saqc/lib/tools.py index 5c719aa535c72433753f973c6f15720fb601931f..4097f36dca35c1b5bbd7aedd06925617b7e701ea 100644 --- a/saqc/lib/tools.py +++ b/saqc/lib/tools.py @@ -10,20 +10,9 @@ from __future__ import annotations import collections import functools import itertools -import operator as op import re import warnings -from typing import ( - Any, - Callable, - Collection, - List, - Literal, - Sequence, - Tuple, - TypeVar, - Union, -) +from typing import Any, Callable, Collection, List, Sequence, TypeVar, Union import numpy as np import pandas as pd @@ -33,40 +22,6 @@ from scipy.cluster.hierarchy import fcluster, linkage from saqc.lib.types import CompT T = TypeVar("T", str, float, int) -BOUND_OPERATORS = { - None: (op.le, op.ge), - "both": (op.lt, op.gt), - "right": (op.le, op.gt), - "left": (op.gt, op.le), -} - - -def isInBounds( - val: int | float, - bounds: Tuple[int | float], - closed: Literal["left", "right", "both"] = None, -): - """ - check if val falls into the interval [left, right] and return boolean accordingly - - val : - value to check - - bounds : - Tuple containing left and right interval bounds. Pass `(a, b)` to define the interval [`a`, `b`]. - Set `a=-inf` or `b=+inf` to set one sided restriction. - - closed : - Enclosure includes the interval bounds into the constraint interval. By default, the bounds - are not included. Pass: - * `"both"`: to include both sides of the interval - * `"left"`: to include left bound - * `"right"`: to include right bound - """ - ops = BOUND_OPERATORS[closed] - if ops[0](val, bounds[0]) or ops[1](val, bounds[1]): - return False - return True def assertScalar(name, value, optional=False): @@ -298,6 +253,7 @@ def estimateFrequency( len_f = len(delta_f) * 2 min_energy = delta_f[0] * min_energy + # calc/assign low/high freq cut offs (makes life easier): min_rate_i = int( len_f / (pd.Timedelta(min_rate).total_seconds() * (10**delta_precision)) ) @@ -427,7 +383,7 @@ def getFreqDelta(index: pd.Index) -> None | pd.Timedelta: (``None`` will also be returned for pd.RangeIndex type.) """ - delta = getattr(index, "window", None) + delta = getattr(index, "freq", None) if delta is None and not index.empty: i = pd.date_range(index[0], index[-1], len(index)) if i.equals(index): diff --git a/tests/common.py b/tests/common.py index cdf418c2327c8f3255f42cbd9f5507cec464069a..e9375c73e77a10c4add8616e80f6c37f494939cf 100644 --- a/tests/common.py +++ b/tests/common.py @@ -88,7 +88,7 @@ def checkInvariants(data, flags, field, identical=True): assert flags[field].dtype == float - # `pd.Index.identical` also check index attributes like `window` + # `pd.Index.identical` also check index attributes like `freq` if identical: assert data[field].index.identical(flags[field].index) else: diff --git a/tests/funcs/test_outlier_detection.py b/tests/funcs/test_outlier_detection.py index 10055b9a58982976cbf67968c8240b34eeac8494..a036c21d2292472196bff1ddb5a7f97b376ff08f 100644 --- a/tests/funcs/test_outlier_detection.py +++ b/tests/funcs/test_outlier_detection.py @@ -184,10 +184,7 @@ def test_flagUniLOF(spiky_data, n, p, thresh): qc = SaQC(data).flagUniLOF(field, n=n, p=p, thresh=thresh) flag_result = qc.flags[field] test_sum = (flag_result[spiky_data[1]] == BAD).sum() - try: - assert test_sum == len(spiky_data[1]) - except AssertionError: - print("stop") + assert test_sum == len(spiky_data[1]) @pytest.mark.parametrize("vars", [1, 2, 3]) diff --git a/tests/requirements.txt b/tests/requirements.txt index 13e6d60aec97ca9136a92d97d32cc2f642c72440..54cb45f04fd7b043e0bd48efe0e2e0225d6c6c93 100644 --- a/tests/requirements.txt +++ b/tests/requirements.txt @@ -7,4 +7,4 @@ hypothesis==6.72.2 Markdown==3.4.3 pytest==7.3.1 pytest-lazy-fixture==0.6.3 -requests==2.28.2 +requests==2.29.0