Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • berntm/saqc
  • rdm-software/saqc
  • schueler/saqc
3 results
Show changes
Commits on Source (31)
Showing
with 52877 additions and 91 deletions
......@@ -134,12 +134,39 @@ doctest:
# Building stage
# ===========================================================
# check if we are able to build a wheel
wheel:
# and if the import works
wheel38:
stage: build
image: python:3.8
script:
- pip install wheel
- pip wheel .
- pip install .
- python -c 'import saqc; print(f"{saqc.__version__=}")'
wheel39:
stage: build
image: python:3.9
script:
- pip install wheel
- pip wheel .
- pip install .
- python -c 'import saqc; print(f"{saqc.__version__=}")'
wheel310:
stage: build
image: python:3.10
script:
- pip install wheel
- pip wheel .
- pip install .
- python -c 'import saqc; print(f"{saqc.__version__=}")'
wheel311:
stage: build
image: python:3.11
script:
- pip install wheel
- pip wheel .
- pip install .
- python -c 'import saqc; print(f"{saqc.__version__=}")'
docs:
stage: build
......
......@@ -10,8 +10,21 @@ SPDX-License-Identifier: GPL-3.0-or-later
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.4.0...develop)
### Added
### Changed
- pin pandas to versions >= 2.0
### Removed
- removed deprecated `DictOfSeries.to_df`
### Fixed
### Deprecated
## [2.4.1](https://git.ufz.de/rdm-software/saqc/-/tags/v2.4.1) - 2023-06-22
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.4.0...develop)
### Added
### Changed
- pin pandas to versions >= 2.0
### Removed
- removed deprecated `DictOfSeries.to_df`
### Fixed
### Deprecated
## [2.4.0](https://git.ufz.de/rdm-software/saqc/-/tags/v2.4.0) - 2023-04-25
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.3.0...v2.4.0)
......@@ -21,11 +34,9 @@ SPDX-License-Identifier: GPL-3.0-or-later
- Expose the `History` via `SaQC._history`
- Config function `cv` (coefficient of variation)
### Changed
- Deprecate `interpolate`, `linear` and `shift` in favor of `align`
- Deprecate `roll` in favor of `rolling`
- Rename `interplateInvalid` to `interpolate`
- Rename `interpolateIndex` to `align`
- Deprecate `flagMVScore` parameters: `partition_min` in favor of `window`, `partition_min` in favor of `min_periods`, `min_periods` in favor of `min_periods_r`
- Rewrite of `dios.DictOfSeries`
### Removed
- Parameter `limit` from `align`
- Parameter `max_na_group_flags`, `max_na_flags`, `flag_func`, `freq_check` from `resample`
......@@ -35,6 +46,11 @@ SPDX-License-Identifier: GPL-3.0-or-later
- `reample` was not writing meta entries
- `flagByStatLowPass` was overwriting existing flags
- `flagUniLOF` and `flagLOF` were overwriting existing flags
### Deprecated
- Deprecate `flagMVScore` parameters: `partition` in favor of `window`, `partition_min` in favor of `min_periods`, `min_periods` in favor of `min_periods_r`
- Deprecate `interpolate`, `linear` and `shift` in favor of `align`
- Deprecate `roll` in favor of `rolling`
- Deprecate `DictOfSeries.to_df` in favor of `DictOfSeries.to_pandas`
## [2.3.0](https://git.ufz.de/rdm-software/saqc/-/tags/v2.3.0) - 2023-01-17
[List of commits](https://git.ufz.de/rdm-software/saqc/-/compare/v2.2.1...v2.3.0)
......
......@@ -57,7 +57,7 @@ SM2 ; shift(freq="15Min")
'SM(1|2)+' ; flagMissing()
SM1 ; flagRange(min=10, max=60)
SM2 ; flagRange(min=10, max=40)
SM2 ; flagMAD(window="30d", z=3.5)
SM2 ; flagZScore(window="30d", thresh=3.5, method='modified', center=False)
Dummy ; flagGeneric(field=["SM1", "SM2"], func=(isflagged(x) | isflagged(y)))
```
......@@ -98,7 +98,7 @@ saqc = (saqc
.flagMissing("SM(1|2)+", regex=True)
.flagRange("SM1", min=10, max=60)
.flagRange("SM2", min=10, max=40)
.flagMAD("SM2", window="30d", z=3.5)
.flagZScore("SM2", window="30d", thresh=3.5, method='modified', center=False)
.flagGeneric(field=["SM1", "SM2"], target="Dummy", func=lambda x, y: (isflagged(x) | isflagged(y))))
```
......
......@@ -15,6 +15,7 @@ Cook Books
DataRegularisation
OutlierDetection
ResidualOutlierDetection
DriftDetection
MultivariateFlagging
../documentation/GenericFunctions
......@@ -63,6 +64,16 @@ Cook Books
+++
*Wrap your custom logical and arithmetic expressions with the generic functions*
.. grid-item-card:: Drift Detection
:link: DriftDetection
:link-type: doc
* define metrics to measure distance between data series
* automatically determine majority and anomalous data groups
+++
*Detecting datachunks drifting apart from a reference group*
.. grid-item-card:: Modelling, Residuals and Arithmetics
:link: ResidualOutlierDetection
:link-type: doc
......
.. SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ
..
.. SPDX-License-Identifier: GPL-3.0-or-later
Drift Detection
===============
Overview
--------
The guide briefly introduces the usage of the :py:meth:`~saqc.SaQC.flagDriftFromNorm` method.
The method detects sections in timeseries that deviate from the majority in a group of variables
* :ref:`Parameters <cookbooks/DriftDetection:Parameters>`
* :ref:`Algorithm <cookbooks/DriftDetection:Algorithm>`
* :ref:`Example Data import <cookbooks/DriftDetection:Example Data import>`
* :ref:`Example Algorithm Application <cookbooks/DriftDetection:Example Algorithm Application>`
Parameters
----------
Although there seems to be a lot of user input to parametrize, most of it is easy to be interpreted and can be selected
defaultly.
window
^^^^^^
Length of the partitions the target group of data series` is divided into.
For example, if selected ``1D`` (one day), the group to check will be divided into one day chunks and every chunk is be checked for time series deviating from the normal group.
frac
^^^^
The percentage of data, needed to define the "normal" group expressed in a number out of :math:`[0,1]`.
This, of course must be something over 50 percent (math:`0.5`), and can be
selected according to the number of drifting variables one expects the data to have at max.
method
^^^^^^
The linkage method can have some impact on the clustering, but sticking to the default value `single` might be
sufficient for most the tasks.
spread
^^^^^^
The main parameter to control the algorithm's behavior. It has to be selected carefully.
It determines the maximum spread of a normal group by limiting the costs, a cluster agglomeration must not exceed in
every linkage step.
For singleton clusters, that costs equals half the distance, the timeseries in the clusters have to each other. So, only timeseries with a distance of less than two times the spreading norm can be clustered.
When timeseries get clustered together, this new clusters distance to all the other timeseries/clusters is calculated
according to the linkage method specified. By default, it is the minimum distance, the members of the clusters have to
each other.
Having that in mind, it is advisable to choose a distance function as metric, that can be well interpreted in the units
dimension of the measurement, and where the interpretation is invariant over the length of the timeseries.
metric
^^^^^^
The default *averaged manhatten metric* roughly represents the averaged value distance of two timeseries (as opposed to *euclidean*, which scales non linearly with the
compared timeseries' length). For the selection of the :py:attr:`spread` parameter the default metric is helpful, since it allows to interpret the spreading in the dimension of the measurements.
Algorithm
---------
The aim of the algorithm is to flag sections in timeseries, that significantly deviate from a normal group of timeseries running in parallel within a given section.
"Normality" is determined in terms of a maximum spreading distance, that members of a normal group must not exceed.
In addition, a group is only considered to be "normal", if it contains more then a certain percentage of the timeseries to be clustered into "normal" ones and "abnormal" ones.
The steps of the algorithm are the following:
* Calculate the distances :math:`d(x_i,x_j)` for all timeseries :math:`x_i` that are to be clustered with a metric specified by the user
* Calculate a dendogram using a hierarchical linkage algorithm, specified by the user.
* Flatten the dendogram at the level, the agglomeration costs exceed the value given by a spreading norm, specified by the user
* check if there is a cluster containing more than a certain percentage of variables as specified by the user.
* if yes: flag all the variables that are not in that cluster
* if no: flag nothing
Example Data Import
-------------------
.. plot::
:context: reset
:include-source: False
import matplotlib
import saqc
import pandas as pd
data = pd.read_csv('../resources/data/tempSensorGroup.csv', index_col=0)
data.index = pd.DatetimeIndex(data.index)
qc = saqc.SaQC(data)
We load the example `data set <https://git.ufz.de/rdm-software/saqc/-/blob/develop/docs/resources/data/tempsenorGroup.csv>`_
from the *saqc* repository using the `pandas <https://pandas.pydata.org/>`_ csv
file reader. Subsequently, we cast the index of the imported data to `DatetimeIndex`
and use the dataframe's `plot` method, to inspect the imported data:
.. doctest:: flagDriftFromNorm
>>> data = pd.read_csv('./resources/data/tempSensorGroup.csv', index_col=0)
>>> data.index = pd.DatetimeIndex(data.index)
>>> data.plot() # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
data.plot()
Example Algorithm Application
-----------------------------
Looking at our example data set more closely, we see that 2 of the 5 variables start to drift away.
.. plot::
:context: close-figs
:include-source: False
:class: center
:caption: 2 variables start departing the majority group of variables (the group containing more than ``frac`` variables) around july.
data['2017-05':'2017-11'].plot()
.. plot::
:context: close-figs
:include-source: False
:class: center
:caption: 2 variables are departed from the majority group of variables (the group containing more than ``frac`` variables) by the end of the year.
data['2017-09':'2018-01'].plot()
Lets try to detect those drifts via saqc. There for we import the *saqc* package and instantiate a :py:class:`saqc.SaQC`
object with the data:
.. doctest:: flagDriftFromNorm
>>> import saqc
>>> qc = saqc.SaQC(data)
The changes we observe in the data seem to develop significantly only in temporal spans over a month,
so we go for ``"1M"`` as value for the
``window`` parameter. We identified the majority group as a group containing three variables, whereby two variables
seem to be scattered away, so that we can leave the ``frac`` value at its default ``.5`` level.
The majority group seems on average not to be spread out more than 3 or 4 degrees. So, for the ``spread`` value
we go for ``3``. This can be interpreted as follows, for every member of a group, there is another member that
is not distanted more than ``3`` degrees from that one (on average in one month) - this should be sufficient to bundle
the majority group and to discriminate against the drifting variables, that seem to deviate more than 3 degrees on
average in a month from any member of the majority group.
.. doctest:: flagDriftFromNorm
>>> variables = ['temp1 [degC]', 'temp2 [degC]', 'temp3 [degC]', 'temp4 [degC]', 'temp5 [degC]']
>>> qc = qc.flagDriftFromNorm(variables, window='1M', spread=3)
.. plot::
:context: close-figs
:include-source: False
:class: center
>>> variables = ['temp1 [degC]', 'temp2 [degC]', 'temp3 [degC]', 'temp4 [degC]', 'temp5 [degC]']
>>> qc = qc.flagDriftFromNorm(variables, window='1M', spread=3)
Lets check the results:
.. doctest:: flagDriftFromNorm
>>> qc.plot('temp1 [degC]') # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
qc.plot('temp1 [degC]')
.. doctest:: flagDriftFromNorm
>>> qc.plot('temp2 [degC]') # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
qc.plot('temp2 [degC]')
.. doctest:: flagDriftFromNorm
>>> qc.plot('temp3 [degC]') # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
qc.plot('temp3 [degC]')
.. doctest:: flagDriftFromNorm
>>> qc.plot('temp4 [degC]') # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
qc.plot('temp4 [degC]')
.. doctest:: flagDriftFromNorm
>>> qc.plot('temp5 [degC]') # doctest: +SKIP
.. plot::
:context: close-figs
:include-source: False
:class: center
qc.plot('temp5 [degC]')
\ No newline at end of file
......@@ -272,7 +272,7 @@ To see all the results obtained so far, plotted in one figure window, we make us
.. doctest:: exampleOD
>>> data.to_df().plot()
>>> data.to_pandas().plot()
<Axes...>
.. plot::
......@@ -281,7 +281,7 @@ To see all the results obtained so far, plotted in one figure window, we make us
:width: 80 %
:class: center
data.to_df().plot()
data.to_pandas().plot()
Residuals and Scores
......
......@@ -62,6 +62,7 @@ Documentation
* outlier detection
* frequency alignment
* drift detection
* data modelling
* wrapping generic or custom functionality
+++
......
......@@ -3,4 +3,4 @@ varname ; test
SM2 ; align(freq="15Min", method="nshift")
SM2 ; flagMissing()
'SM(1|2)+' ; flagRange(min=10, max=60)
SM2 ; flagMAD(window="30d", z=3.5)
SM2 ; flagZScore(window="30d", thresh=3.5, method='modified', center=False)
......@@ -3,5 +3,5 @@ SM2;align(freq="15Min", method="nshift");False
'.*';flagRange(min=10, max=60);False
SM2;flagMissing();False
SM2;flagRange(min=10, max=60);False
SM2;flagMAD(window="30d", z=3.5);False
SM2;flagZScore(window="30d", thresh=3.5, method='modified', center=False);False
Dummy;flag(func=(isflagged(SM1) | isflagged(SM2)))
varname;test
#------;--------------------------
SM2 ;flagRange(min=10, max=60)
SM2 ;flagMAD(window="30d", z=3.5)
SM2 ;flagZScore(window="30d", thresh=3.5, method="modified", center=False)
SM2 ;plot()
\ No newline at end of file
varname;test
#------;--------------------------
SM2 ;flagRange(min=-20, max=60)
SM2 ;flagMAD(window="30d", z=3.5)
SM2 ;flagZScore(window="30d", thresh=3.5, method='modified', center=False)
SM2 ;plot()
\ No newline at end of file
......@@ -2,8 +2,8 @@ varname;test
#------;--------------------------
SM1;flagRange(min=10, max=60)
SM2;flagRange(min=10, max=60)
SM1;flagMAD(window="15d", z=3.5)
SM2;flagMAD(window="30d", z=3.5)
SM1;flagZScore(window="15d", thresh=3.5, method='modified')
SM2;flagZScore(window="30d", thresh=3.5, method='modified')
SM1;plot(path='../resources/temp/SM1processingResults')
SM2;plot(path='../resources/temp/SM2processingResults')
This diff is collapsed.
SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ
SPDX-License-Identifier: GPL-3.0-or-later
\ No newline at end of file
docs/resources/images/ZscorePopulation.png

793 KiB

SPDX-FileCopyrightText: 2021 Helmholtz-Zentrum für Umweltforschung GmbH - UFZ
SPDX-License-Identifier: GPL-3.0-or-later
\ No newline at end of file
......@@ -6,7 +6,6 @@ Click==8.1.3
docstring_parser==0.15
dtw==1.4.0
matplotlib==3.7.1
numba==0.57.0
numpy==1.24.3
outlier-utils==0.0.3
pyarrow==11.0.0
......@@ -14,4 +13,4 @@ pandas==2.0.1
scikit-learn==1.2.2
scipy==1.10.1
typing_extensions==4.5.0
fancy-collections==0.1.3
fancy-collections==0.2.1
\ No newline at end of file
......@@ -479,7 +479,6 @@ class Flags:
"""
Transform the flags container to a ``DictOfSeries``.
.. deprecated:: 2.4
use `saqc.DictOfSeries(obj)` instead.
......
......@@ -4,10 +4,8 @@
# -*- coding: utf-8 -*-
from __future__ import annotations
import warnings
from typing import Any, Hashable, Mapping
import numpy as np
import pandas as pd
from fancy_collections import DictOfPandas
......@@ -37,19 +35,6 @@ class DictOfSeries(DictOfPandas):
def attrs(self, value: Mapping[Hashable, Any]) -> None:
self._attrs = dict(value)
def to_df(self, how="outer") -> pd.DataFrame:
"""
Transform DictOfSeries to a pandas.DataFrame.
.. deprecated:: 2.4
use `DictOfSeries.to_pandas()` instead.
"""
warnings.warn(
f"`to_df()` is deprecated use `to_pandas()` instead.",
category=DeprecationWarning,
)
return self.to_pandas(how)
def flatten(self, promote_index: bool = False) -> DictOfSeries:
"""
Return a copy.
......@@ -57,16 +42,6 @@ class DictOfSeries(DictOfPandas):
"""
return self.copy()
def to_pandas(self, how="outer"):
# This is a future feature from fancy_collections.DictOfPandas
# wich probably will come in the next version 0.1.4.
# We adopt this early, to prevent a second refactoring.
# The docstring will be different, so we keep the
# dynamic docstring allocation, down below.
# If the feature is present we just need to delete the
# entire method here.
return self.to_dataframe(how)
def index_of(self, method="union") -> pd.Index:
"""Return an index with indices from all columns.
......@@ -194,8 +169,3 @@ or is dropped if `how='inner'`
a b c
1 11.0 22.0 33.0
"""
DictOfSeries.to_dataframe.__doc__ = DictOfSeries.to_pandas.__doc__.replace(
"to_pandas", "to_dataframe"
)
......@@ -8,9 +8,8 @@
from __future__ import annotations
import typing
from typing import TYPE_CHECKING, Callable, Tuple
from typing import TYPE_CHECKING, Callable, Literal, Tuple
import numba
import numpy as np
import pandas as pd
......@@ -44,9 +43,11 @@ class ChangepointsMixin:
Parameters
----------
stat_func :
A function that assigns a value to every twin window. The backward-facing
window content will be passed as the first array, the forward-facing window
content as the second.
* If callable: A function that assigns a scalar value to every twin window. The backward-facing
window content will be passed as the first array, the forward-facing window
content as the second.
* If string: The respective statistic will be calculated for both the windows and the absolute difference of
the results will be returned.
thresh_func :
A function that determines the value level, exceeding wich qualifies a
......@@ -245,31 +246,8 @@ def _getChangePoints(
check_len = len(fwd_end)
data_arr = data.values
# Please keep this as I sometimes need to disable jitting manually
# to make it work with my debugger :/
# --palmb
try_to_jit = True
if try_to_jit:
jit_sf = numba.jit(stat_func, nopython=True)
jit_tf = numba.jit(thresh_func, nopython=True)
try:
jit_sf(
data_arr[bwd_start[0] : bwd_end[0]], data_arr[fwd_start[0] : fwd_end[0]]
)
jit_tf(
data_arr[bwd_start[0] : bwd_end[0]], data_arr[fwd_start[0] : fwd_end[0]]
)
stat_func = jit_sf
thresh_func = jit_tf
except (numba.TypingError, numba.UnsupportedError, IndexError):
try_to_jit = False
args = data_arr, bwd_start, fwd_end, split, stat_func, thresh_func, check_len
if try_to_jit:
stat_arr, thresh_arr = _slidingWindowSearchNumba(*args)
else:
stat_arr, thresh_arr = _slidingWindowSearch(*args)
stat_arr, thresh_arr = _slidingWindowSearch(*args)
result_arr = stat_arr > thresh_arr
......@@ -324,20 +302,6 @@ def _getChangePoints(
)
@numba.jit(parallel=True, nopython=True)
def _slidingWindowSearchNumba(
data_arr, bwd_start, fwd_end, split, stat_func, thresh_func, num_val
):
stat_arr = np.zeros(num_val)
thresh_arr = np.zeros(num_val)
for win_i in numba.prange(0, num_val - 1):
x = data_arr[bwd_start[win_i] : split[win_i]]
y = data_arr[split[win_i] : fwd_end[win_i]]
stat_arr[win_i] = stat_func(x, y)
thresh_arr[win_i] = thresh_func(x, y)
return stat_arr, thresh_arr
def _slidingWindowSearch(
data_arr, bwd_start, fwd_end, split, stat_func, thresh_func, num_val
):
......@@ -353,7 +317,7 @@ def _slidingWindowSearch(
def _reduceCPCluster(stat_arr, thresh_arr, start, end, obj_func, num_val):
out_arr = np.zeros(shape=num_val, dtype=bool)
for win_i in numba.prange(0, num_val):
for win_i in range(num_val):
s, e = start[win_i], end[win_i]
x = stat_arr[s:e]
y = thresh_arr[s:e]
......