saqc.py

    field : str
        Name of the column, holding the data-to-be-interpolated.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}
        The interpolation method to use.
    
    inter_order : int, default 2
        If there your selected interpolation method can be performed at different 'orders' - here you pass the desired
        order.
    
    inter_limit : int, default 2
        Maximum number of consecutive 'nan' values allowed for a gap to be interpolated. This really restricts the
        interpolation to chunks, containing not more than `inter_limit` successive nan entries.
    
    flag : float or None, default UNFLAGGED
        Flag that is set for interpolated values. If ``None``, no flags are set at all.
    
    downgrade_interpolation : bool, default False
        If `True` and the interpolation can not be performed at current order, retry with a lower order.
        This can happen, because the chosen ``method`` does not support the passed ``inter_order``, or
        simply because not enough values are present in a interval.
    """
    pass


def interpolateIndex(field, freq, method, inter_order, inter_limit, downgrade_interpolation):
    """
    Function to interpolate the data at regular (equidistant) timestamps (or Grid points).
    
    Note, that the interpolation will only be calculated, for grid timestamps that have a preceding AND a succeeding
    valid data value within "freq" range.
    
    Parameters
    ----------
    field : str
        Name of the column, holding the data-to-be-interpolated.
    
    freq : str
        An Offset String, interpreted as the frequency of
        the grid you want to interpolate your data at.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}: string
        The interpolation method you want to apply.
    
    inter_order : int, default 2
        If there your selected interpolation method can be performed at different 'orders' - here you pass the desired
        order.
    
    inter_limit : int, default 2
        Maximum number of consecutive 'nan' values allowed for a gap to be interpolated. This really restricts the
        interpolation to chunks, containing not more than `inter_limit` successive nan entries.
    
    downgrade_interpolation : bool, default False
        If `True` and the interpolation can not be performed at current order, retry with a lower order.
        This can happen, because the chosen ``method`` does not support the passed ``inter_order``, or
        simply because not enough values are present in a interval.
    
    """
    pass


def flagByStatLowPass(field):
    """
    Flag *chunks* of length, `winsz`:
    
    1. If they excexceed `thresh` with regard to `stat`:
    2. If all (maybe overlapping) *sub-chunks* of *chunk*, with length `sub_winsz`,
       `excexceed `sub_thresh` with regard to `stat`:
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    """
    pass


def flagByStray(field, partition_freq, partition_min, iter_start, alpha, flag):
    """
    Flag outliers in 1-dimensional (score) data with the STRAY Algorithm.
    
    Find more information on the algorithm in References [1].
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    partition_freq : str, int, or None, default None
        Determines the segmentation of the data into partitions, the kNN algorithm is
        applied onto individually.
    
        * ``np.inf``: Apply Scoring on whole data set at once
        * ``x`` > 0 : Apply scoring on successive data chunks of periods length ``x``
        * Offset String : Apply scoring on successive partitions of temporal extension matching the passed offset
          string
    
    partition_min : int, default 11
        Minimum number of periods per partition that have to be present for a valid outlier dettection to be made in
        this partition. (Only of effect, if `partition_freq` is an integer.) Partition min value must always be
        greater then the nn_neighbors value.
    
    iter_start : float, default 0.5
        Float in [0,1] that determines which percentage of data is considered "normal". 0.5 results in the stray
        algorithm to search only the upper 50 % of the scores for the cut off point. (See reference section for more
        information)
    
    alpha : float, default 0.05
        Level of significance by which it is tested, if a score might be drawn from another distribution, than the
        majority of the data.
    
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] Talagala, P. D., Hyndman, R. J., & Smith-Miles, K. (2019). Anomaly detection in high dimensional data.
        arXiv preprint arXiv:1908.04000.
    """
    pass


def flagMVScores(field, fields, trafo, alpha, n_neighbors, scoring_func, iter_start, stray_partition, stray_partition_min, trafo_on_partition, reduction_range, reduction_drop_flagged, reduction_thresh, reduction_min_periods, flag):
    """
    The algorithm implements a 3-step outlier detection procedure for simultaneously flagging of higher dimensional
    data (dimensions > 3).
    
    In references [1], the procedure is introduced and exemplified with an application on hydrological data.
    
    See the notes section for an overview over the algorithms basic steps.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    fields : List[str]
        List of fieldnames, corresponding to the variables that are to be included into the flagging process.
    trafo : callable, default lambda x:x
        Transformation to be applied onto every column before scoring. Will likely get deprecated soon. Its better
        to transform the data in a processing step, preceeeding the call to ``flagMVScores``.
    alpha : float, default 0.05
        Level of significance by which it is tested, if an observations score might be drawn from another distribution
        than the majority of the observation.
    n_neighbors : int, default 10
        Number of neighbors included in the scoring process for every datapoint.
    scoring_func : Callable[numpy.array, float], default np.sum
        The function that maps the set of every points k-nearest neighbor distances onto a certain scoring.
    iter_start : float, default 0.5
        Float in [0,1] that determines which percentage of data is considered "normal". 0.5 results in the threshing
        algorithm to search only the upper 50 % of the scores for the cut off point. (See reference section for more
        information)
    stray_partition : {None, str, int}, default None
        Only effective when `threshing` = 'stray'.
        Determines the size of the data partitions, the data is decomposed into. Each partition is checked seperately
        for outliers. If a String is passed, it has to be an offset string and it results in partitioning the data into
        parts of according temporal length. If an integer is passed, the data is simply split up into continous chunks
        of `partition_freq` periods. if ``None`` is passed (default), all the data will be tested in one run.
    stray_partition_min : int, default 11
        Only effective when `threshing` = 'stray'.
        Minimum number of periods per partition that have to be present for a valid outlier detection to be made in
        this partition. (Only of effect, if `stray_partition` is an integer.)
    trafo_on_partition : bool, default True
        Whether or not to apply the passed transformation on every partition the algorithm is applied on, separately.
    reduction_range : {None, str}, default None
        If not None, it is tried to reduce the stray result onto single outlier components of the input fields.
        An offset string, denoting the range of the temporal surrounding to include into the MAD testing while trying
        to reduce flags.
    reduction_drop_flagged : bool, default False
        Only effective when `reduction_range` is not ``None``.
        Whether or not to drop flagged values other than the value under test from the temporal surrounding
        before checking the value with MAD.
    reduction_thresh : float, default 3.5
        Only effective when `reduction_range` is not ``None``.
        The `critical` value, controlling wheather the MAD score is considered referring to an outlier or not.
        Higher values result in less rigid flagging. The default value is widely considered apropriate in the
        literature.
    reduction_min_periods : int, 1
        Only effective when `reduction_range` is not ``None``.
        Minimum number of meassurements necessarily present in a reduction interval for reduction actually to be
        performed.
    flag : float, default BAD
        flag to set.
    
    Notes
    -----
    The basic steps are:
    
    1. transforming
    
    The different data columns are transformed via timeseries transformations to
    (a) make them comparable and
    (b) make outliers more stand out.
    
    This step is usually subject to a phase of research/try and error. See [1] for more details.
    
    Note, that the data transformation as an built-in step of the algorithm, will likely get deprecated soon. Its better
    to transform the data in a processing step, preceeding the multivariate flagging process. Also, by doing so, one
    gets mutch more control and variety in the transformation applied, since the `trafo` parameter only allows for
    application of the same transformation to all of the variables involved.
    
    2. scoring
    
    Every observation gets assigned a score depending on its k nearest neighbors. See the `scoring_method` parameter
    description for details on the different scoring methods. Furthermore [1], [2] may give some insight in the
    pro and cons of the different methods.
    
    3. threshing
    
    The gaps between the (greatest) scores are tested for beeing drawn from the same
    distribution as the majority of the scores. If a gap is encountered, that, with sufficient significance, can be
    said to not be drawn from the same distribution as the one all the smaller gaps are drawn from, than
    the observation belonging to this gap, and all the observations belonging to gaps larger then this gap, get flagged
    outliers. See description of the `threshing` parameter for more details. Although [2] gives a fully detailed
    overview over the `stray` algorithm.
    """
    pass


def flagRaise(field, thresh, raise_window, intended_freq, average_window, mean_raise_factor, min_slope, min_slope_weight, numba_boost, flag):
    """
    The function flags raises and drops in value courses, that exceed a certain threshold
    within a certain timespan.
    
    The parameter variety of the function is owned to the intriguing
    case of values, that "return" from outlierish or anomalious value levels and
    thus exceed the threshold, while actually being usual values.
    
    NOTE, the dataset is NOT supposed to be harmonized to a time series with an
    equidistant frequency grid.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    thresh : float
        The threshold, for the total rise (thresh > 0), or total drop (thresh < 0), value courses must
        not exceed within a timespan of length `raise_window`.
    raise_window : str
        An offset string, determining the timespan, the rise/drop thresholding refers to. Window is inclusively defined.
    intended_freq : str
        An offset string, determining The frequency, the timeseries to-be-flagged is supposed to be sampled at.
        The window is inclusively defined.
    average_window : {None, str}, default None
        See condition (2) of the description linked in the references. Window is inclusively defined.
        The window defaults to 1.5 times the size of `raise_window`
    mean_raise_factor : float, default 2
        See second condition listed in the notes below.
    min_slope : {None, float}, default None
        See third condition listed in the notes below.
    min_slope_weight : float, default 0.8
        See third condition listed in the notes below.
    numba_boost : bool, default True
        deprecated ?
    flag : float, default BAD
        flag to set.
    
    Notes
    -----
    The value :math:`x_{k}` of a time series :math:`x` with associated
    timestamps :math:`t_i`, is flagged a raise, if:
    
    * There is any value :math:`x_{s}`, preceeding :math:`x_{k}` within `raise_window` range, so that:
    
      * :math:`M = |x_k - x_s | >`  `thresh` :math:`> 0`
    
    * The weighted average :math:`\mu^{*}` of the values, preceding :math:`x_{k}` within `average_window`
      range indicates, that :math:`x_{k}` does not return from an "outlierish" value course, meaning that:
    
      * :math:`x_k > \mu^* + ( M` / `mean_raise_factor` :math:`)`
    
    * Additionally, if `min_slope` is not `None`, :math:`x_{k}` is checked for being sufficiently divergent from its
      very predecessor :max:`x_{k-1}`$, meaning that, it is additionally checked if:
    
      * :math:`x_k - x_{k-1} >` `min_slope`
      * :math:`t_k - t_{k-1} >` `min_slope_weight` :math:`\times` `intended_freq`
    """
    pass


def flagMAD(field, window, flag):
    """
    The function represents an implementation of the modyfied Z-score outlier detection method.
    
    See references [1] for more details on the algorithm.
    
    Note, that the test needs the input data to be sampled regularly (fixed sampling rate).
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    window : str
       Offset string. Denoting the windows size that the "Z-scored" values have to lie in.
    z: float, default 3.5
        The value the Z-score is tested against. Defaulting to 3.5 (Recommendation of [1])
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
    """
    pass


def flagOffset(field, thresh, tolerance, window, rel_thresh, numba_kickin, flag):
    """
    A basic outlier test that is designed to work for harmonized and not harmonized data.
    
    The test classifies values/value courses as outliers by detecting not only a rise in value, but also,
    checking for a return to the initial value level.
    
    Values :math:`x_n, x_{n+1}, .... , x_{n+k}` of a timeseries :math:`x` with associated timestamps
    :math:`t_n, t_{n+1}, .... , t_{n+k}` are considered spikes, if
    
    1. :math:`|x_{n-1} - x_{n + s}| >` `thresh`, for all :math:`s \in [0,1,2,...,k]`
    
    2. :math:`|x_{n-1} - x_{n+k+1}| <` `tolerance`
    
    3. :math:`|t_{n-1} - t_{n+k+1}| <` `window`
    
    Note, that this definition of a "spike" not only includes one-value outliers, but also plateau-ish value courses.
    
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    thresh : float
        Minimum difference between to values, to consider the latter one as a spike. See condition (1)
    tolerance : float
        Maximum difference between pre-spike and post-spike values. See condition (2)
    window : {str, int}, default '15min'
        Maximum length of "spiky" value courses. See condition (3). Integer defined window length are only allowed for
        regularly sampled timeseries.
    rel_thresh : {float, None}, default None
        Relative threshold.
    numba_kickin : int, default 200000
        When there are detected more than `numba_kickin` incidents of potential spikes,
        the pandas.rolling - part of computation gets "jitted" with numba.
        Default value hast proven to be around the break even point between "jit-boost" and "jit-costs".
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    The implementation is a time-window based version of an outlier test from the UFZ Python library,
    that can be found here:
    
    https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py
    """
    pass


def flagByGrubbs(field, winsz, alpha, min_periods, flag):
    """
    The function flags values that are regarded outliers due to the grubbs test.
    
    See reference [1] for more information on the grubbs tests definition.
    
    The (two-sided) test gets applied onto data chunks of size "winsz". The tests application  will
    be iterated on each data-chunk under test, till no more outliers are detected in that chunk.
    
    Note, that the test performs poorely for small data chunks (resulting in heavy overflagging).
    Therefor you should select "winsz" so that every window contains at least > 8 values and also
    adjust the min_periods values accordingly.
    
    Note, that the data to be tested by the grubbs test are expected to be distributed "normalish".
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    winsz : {int, str}
        The size of the window you want to use for outlier testing. If an integer is passed, the size
        refers to the number of periods of every testing window. If a string is passed, it has to be an offset string,
        and will denote the total temporal extension of every window.
    alpha : float, default 0.05
        The level of significance, the grubbs test is to be performed at. (between 0 and 1)
    min_periods : int, default 8
        The minimum number of values that have to be present in an interval under test, for a grubbs test result to be
        accepted. Only makes sence in case `winsz` is an offset string.
    check_lagged: boolean, default False
        If True, every value gets checked twice for being an outlier. Ones in the initial rolling window and one more
        time in a rolling window that is lagged by half the windows delimeter (winsz/2). Recommended for avoiding false
        positives at the window edges. Only available when rolling with integer defined window size.
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    introduction to the grubbs test:
    
    [1] https://en.wikipedia.org/wiki/Grubbs%27s_test_for_outliers
    """
    pass


def flagRange(field, min, max, flag):
    """
    Function flags values not covered by the closed interval [`min`, `max`].
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    min : float
        Lower bound for valid data.
    max : float
        Upper bound for valid data.
    flag : float, default BAD
        flag to set.
    """
    pass


def flagCrossStatistic(field, fields, thresh, cross_stat, flag):
    """
    Function checks for outliers relatively to the "horizontal" input data axis.
    
    For `fields` :math:`=[f_1,f_2,...,f_N]` and timestamps :math:`[t_1,t_2,...,t_K]`, the following steps are taken
    for outlier detection:
    
    1. All timestamps :math:`t_i`, where there is one :math:`f_k`, with :math:`data[f_K]` having no entry at
       :math:`t_i`, are excluded from the following process (inner join of the :math:`f_i` fields.)
    2. for every :math:`0 <= i <= K`, the value
       :math:`m_j = median(\{data[f_1][t_i], data[f_2][t_i], ..., data[f_N][t_i]\})` is calculated
    2. for every :math:`0 <= i <= K`, the set
       :math:`\{data[f_1][t_i] - m_j, data[f_2][t_i] - m_j, ..., data[f_N][t_i] - m_j\}` is tested for outliers with the
       specified method (`cross_stat` parameter).
    
    Parameters
    ----------
    field : str
        A dummy parameter.
    fields : str
        List of fieldnames in data, determining wich variables are to be included into the flagging process.
    thresh : float
        Threshold which the outlier score of an value must exceed, for being flagged an outlier.
    cross_stat : {'modZscore', 'Zscore'}, default 'modZscore'
        Method used for calculating the outlier scores.
    
        * ``'modZscore'``: Median based "sigma"-ish approach. See Referenecs [1].
        * ``'Zscore'``: Score values by how many times the standard deviation they differ from the median.
          See References [1]
    
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
    """
    pass


def flagPatternByDTW(field, flag):
    """
    Pattern recognition via wavelets.
    
    The steps are:
     1. work on chunks returned by a moving window
     2. each chunk is compared to the given pattern, using the wavelet algorithm as presented in [1]
     3. if the compared chunk is equal to the given pattern it gets flagged
    
    Parameters
    ----------
    
    field : str
        The fieldname of the data column, you want to correct.
    flag : float, default BAD
        flag to set.
    
    kwargs
    
    References
    ----------
    
    The underlying pattern recognition algorithm using wavelets is documented here:
    [1] Maharaj, E.A. (2002): Pattern Recognition of Time Series using Wavelets. In: Härdle W., Rönz B. (eds) Compstat. Physica, Heidelberg, 978-3-7908-1517-7.
    
    The documentation of the python package used for the wavelt decomposition can be found here:
    [2] https://pywavelets.readthedocs.io/en/latest/ref/cwt.html#continuous-wavelet-families
    """
    pass


def flagPatternByWavelet(field, flag):
    """
    Pattern Recognition via Dynamic Time Warping.
    
    The steps are:
     1. work on chunks returned by a moving window
     2. each chunk is compared to the given pattern, using the dynamic time warping algorithm as presented in [1]
     3. if the compared chunk is equal to the given pattern it gets flagged
    
    Parameters
    ----------
    
    field : str
        The fieldname of the data column, you want to correct.
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    Find a nice description of underlying the Dynamic Time Warping Algorithm here:
    
    [1] https://cran.r-project.org/web/packages/dtw/dtw.pdf
    """
    pass


def aggregate(field, freq, value_func, flag_func, method, flag):
    """
    A method to "regularize" data by aggregating (resampling) data at a regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    The data will therefor get aggregated with a function, specified by the `value_func` parameter and
    the result gets projected onto the new timestamps with a method, specified by "method".
    
    The following method (keywords) are available:
    
    * ``'nagg'``: (aggreagtion to nearest) - all values in the range (+/- freq/2) of a grid point get aggregated with
      `agg_func`. and assigned to it. Flags get aggregated by `flag_func` and assigned the same way.
    * ``'bagg'``: (backwards aggregation) - all values in a sampling interval get aggregated with agg_func and the
      result gets assigned to the last regular timestamp. Flags get aggregated by `flag_func` and assigned the same way.
    * ``'fagg'``: (forward aggregation) - all values in a sampling interval get aggregated with agg_func and the result
      gets assigned to the next regular timestamp. Flags get aggregated by `flag_func` and assigned the same way.
    
    Note, that, if there is no valid data (exisitng and not-na) available in a sampling interval assigned to a regular
    timestamp by the selected method, nan gets assigned to this timestamp. The associated flag will be of value
    ``UNFLAGGED``.
    
    Note: the method will likely and significantly alter values and shape of ``data[field]``. The original data is kept
    in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        The sampling frequency the data is to be aggregated (resampled) at.
    
    value_func : Callable
        The function you want to use for aggregation.
    
    flag_func : Callable
        The function you want to aggregate the flags with. It should be capable of operating on the flags dtype
        (usually ordered categorical).
    
    method : {'fagg', 'bagg', 'nagg'}, default 'nagg'
        Specifies which intervals to be aggregated for a certain timestamp. (preceeding, succeeding or
        "surrounding" interval). See description above for more details.
    
    flag : float, default BAD
        flag to set.
    
    """
    pass


def linear(field, freq):
    """
    A method to "regularize" data by interpolating linearly the data at regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    Interpolated values will get assigned the worst flag within freq-range.
    
    Note: the method will likely and significantly alter values and shape of ``data[field]``. The original data is kept
    in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Note, that the data only gets interpolated at those (regular) timestamps, that have a valid (existing and
    not-na) datapoint preceeding them and one succeeding them within freq range.
    Regular timestamp that do not suffice this condition get nan assigned AND The associated flag will be of value
    ``UNFLAGGED``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        An offset string. The frequency of the grid you want to interpolate your data at.
    """
    pass


def interpolate(field, freq, method, order):
    """
    A method to "regularize" data by interpolating the data at regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    Interpolated values will get assigned the worst flag within freq-range.
    
    There are available all the interpolations from the pandas.Series.interpolate method and they are called by
    the very same keywords.
    
    Note, that, to perform a timestamp aware, linear interpolation, you have to pass ``'time'`` as `method`,
    and NOT ``'linear'``.
    
    Note: the `method` will likely and significantly alter values and shape of ``data[field]``. The original data is
    kept in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Note, that the data only gets interpolated at those (regular) timestamps, that have a valid (existing and
    not-na) datapoint preceeding them and one succeeding them within freq range.
    Regular timestamp that do not suffice this condition get nan assigned AND The associated flag will be of value
    ``UNFLAGGED``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        An offset string. The frequency of the grid you want to interpolate your data at.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}
        The interpolation method you want to apply.
    
    order : int, default 1
        If your selected interpolation method can be performed at different *orders* - here you pass the desired
        order.
    """
    pass


def mapToOriginal(field, method):
    """
    The Function function "undoes" regularization, by regaining the original data and projecting the
    flags calculated for the regularized data onto the original ones.
    
    Afterwards the regularized data is removed from the data dios and ``'field'`` will be associated
    with the original data "again".
    
    Wherever the flags in the original data are "better" then the regularized flags projected on them,
    they get overridden with this regularized flags value.
    
    Which regularized flags are to be projected on which original flags, is controlled by the "method" parameters.
    
    Generally, if you regularized with the method "X", you should pass the method "inverse_X" to the deharmonization.
    If you regularized with an interpolation, the method "inverse_interpolation" would be the appropriate choice.
    Also you should pass the same drop flags keyword.
    
    The deharm methods in detail:
    ("original_flags" are associated with the original data that is to be regained,
    "regularized_flags" are associated with the regularized data that is to be "deharmonized",
    "freq" refers to the regularized datas sampling frequencie)
    
    * ``'inverse_nagg'``: all original_flags within the range *+/- freq/2* of a regularized_flag, get assigned this
      regularized flags value. (if regularized_flags > original_flag)
    * ``'inverse_bagg'``: all original_flags succeeding a regularized_flag within the range of "freq", get assigned this
      regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_fagg'``: all original_flags preceeding a regularized_flag within the range of "freq", get assigned this
      regularized flags value. (if regularized_flag > original_flag)
    
    * ``'inverse_interpolation'``: all original_flags within the range *+/- freq* of a regularized_flag, get assigned this
      regularized flags value (if regularized_flag > original_flag).
    
    * ``'inverse_nshift'``: That original_flag within the range +/- *freq/2*, that is nearest to a regularized_flag,
      gets the regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_bshift'``: That original_flag succeeding a source flag within the range freq, that is nearest to a
      regularized_flag, gets assigned this regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_nshift'``: That original_flag preceeding a regularized flag within the range freq, that is nearest to a
      regularized_flag, gets assigned this regularized flags value. (if source_flag > original_flag)
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-deharmonized.
    
    method : {'inverse_fagg', 'inverse_bagg', 'inverse_nagg', 'inverse_fshift', 'inverse_bshift', 'inverse_nshift',
            'inverse_interpolation'}
        The method used for projection of regularized flags onto original flags. See description above for more
        details.
    """
    pass


def shift(field, freq, method, freq_check):
    """
    Function to shift data and flags to a regular (equidistant) timestamp grid, according to ``method``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-shifted.
    
    freq : str
        An frequency Offset String that will be interpreted as the sampling rate you want the data to be shifted to.
    
    method : {'fshift', 'bshift', 'nshift'}, default 'nshift'
        Specifies how misaligned data-points get propagated to a grid timestamp.
        Following choices are available:
    
        * 'nshift' : every grid point gets assigned the nearest value in its range. (range = +/- 0.5 * `freq`)
        * 'bshift' : every grid point gets assigned its first succeeding value, if one is available in
          the succeeding sampling interval.
        * 'fshift' : every grid point gets assigned its ultimately preceding value, if one is available in
          the preceeding sampling interval.
    
    freq_check : {None, 'check', 'auto'}, default None
    
        * ``None`` : do not validate frequency-string passed to `freq`
        * 'check' : estimate frequency and log a warning if estimate miss matches frequency string passed to `freq`,
          or if no uniform sampling rate could be estimated
        * 'auto' : estimate frequency and use estimate. (Ignores `freq` parameter.)
    """
    pass


def resample(field, freq, agg_func, max_invalid_total_d, max_invalid_consec_d, max_invalid_total_f, max_invalid_consec_f, flag_agg_func, freq_check):
    """
    Function to resample the data. Afterwards the data will be sampled at regular (equidistant) timestamps
    (or Grid points). Sampling intervals therefor get aggregated with a function, specifyed by 'agg_func' parameter and
    the result gets projected onto the new timestamps with a method, specified by "method". The following method
    (keywords) are available:
    
    * ``'nagg'``: all values in the range (+/- `freq`/2) of a grid point get aggregated with agg_func and assigned to it.
    * ``'bagg'``: all values in a sampling interval get aggregated with agg_func and the result gets assigned to the last
      grid point.
    * ``'fagg'``: all values in a sampling interval get aggregated with agg_func and the result gets assigned to the next
      grid point.
    
    
    Note, that. if possible, functions passed to agg_func will get projected internally onto pandas.resample methods,
    wich results in some reasonable performance boost - however, for this to work, you should pass functions that have
    the __name__ attribute initialised and the according methods name assigned to it.
    Furthermore, you shouldnt pass numpys nan-functions
    (``nansum``, ``nanmean``,...) because those for example, have ``__name__ == 'nansum'`` and they will thus not
    trigger ``resample.func()``, but the slower ``resample.apply(nanfunc)``. Also, internally, no nans get passed to
    the functions anyway, so that there is no point in passing the nan functions.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-resampled.
    
    freq : str
        An Offset String, that will be interpreted as the frequency you want to resample your data with.
    
    agg_func : Callable
        The function you want to use for aggregation.
    
    method: {'fagg', 'bagg', 'nagg'}, default 'bagg'
        Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or
        "surrounding" interval). See description above for more details.
    
    max_invalid_total_d : {None, int}, default None
        Maximum number of invalid (nan) datapoints, allowed per resampling interval. If max_invalid_total_d is
        exceeded, the interval gets resampled to nan. By default (``np.inf``), there is no bound to the number of nan
        values in an interval and only intervals containing ONLY nan values or those, containing no values at all,
        get projected onto nan
    
    max_invalid_consec_d : {None, int}, default None
        Maximum number of consecutive invalid (nan) data points, allowed per resampling interval.
        If max_invalid_consec_d is exceeded, the interval gets resampled to nan. By default (np.inf),
        there is no bound to the number of consecutive nan values in an interval and only intervals
        containing ONLY nan values, or those containing no values at all, get projected onto nan.
    
    max_invalid_total_f : {None, int}, default None
        Same as `max_invalid_total_d`, only applying for the flags. The flag regarded as "invalid" value,
        is the one passed to empty_intervals_flag (default=``BAD``).
        Also this is the flag assigned to invalid/empty intervals.
    
    max_invalid_consec_f : {None, int}, default None
        Same as `max_invalid_total_f`, only applying onto flags. The flag regarded as "invalid" value, is the one passed
        to empty_intervals_flag. Also this is the flag assigned to invalid/empty intervals.
    
    flag_agg_func : Callable, default: max
        The function you want to aggregate the flags with. It should be capable of operating on the flags dtype
        (usually ordered categorical).
    
    freq_check : {None, 'check', 'auto'}, default None
    
        * ``None``: do not validate frequency-string passed to `freq`
        * ``'check'``: estimate frequency and log a warning if estimate miss matchs frequency string passed to 'freq', or
          if no uniform sampling rate could be estimated
        * ``'auto'``: estimate frequency and use estimate. (Ignores `freq` parameter.)
    """
    pass


def reindexFlags(field, method, source, freq):
    """
    The Function projects flags of "source" onto flags of "field". Wherever the "field" flags are "better" then the
    source flags projected on them, they get overridden with this associated source flag value.
    
    Which "field"-flags are to be projected on which source flags, is controlled by the "method" and "freq"
    parameters.
    
    method: (field_flag in associated with "field", source_flags associated with "source")
    
    'inverse_nagg' - all field_flags within the range +/- freq/2 of a source_flag, get assigned this source flags value.
        (if source_flag > field_flag)
    'inverse_bagg' - all field_flags succeeding a source_flag within the range of "freq", get assigned this source flags
        value. (if source_flag > field_flag)
    'inverse_fagg' - all field_flags preceeding a source_flag within the range of "freq", get assigned this source flags
        value. (if source_flag > field_flag)
    
    'inverse_interpolation' - all field_flags within the range +/- freq of a source_flag, get assigned this source flags value.
        (if source_flag > field_flag)
    
    'inverse_nshift' - That field_flag within the range +/- freq/2, that is nearest to a source_flag, gets the source
        flags value. (if source_flag > field_flag)
    'inverse_bshift' - That field_flag succeeding a source flag within the range freq, that is nearest to a
        source_flag, gets assigned this source flags value. (if source_flag > field_flag)
    'inverse_nshift' - That field_flag preceeding a source flag within the range freq, that is nearest to a
        source_flag, gets assigned this source flags value. (if source_flag > field_flag)
    
    'match' - any field_flag with a timestamp matching a source_flags timestamp gets this source_flags value
    (if source_flag > field_flag)
    
    Note, to undo or backtrack a resampling/shifting/interpolation that has been performed with a certain method,
    you can just pass the associated "inverse" method. Also you should pass the same drop flags keyword.
    
    Parameters
    ----------
    field : str
        The fieldname of the data column, you want to project the source-flags onto.
    
    method : {'inverse_fagg', 'inverse_bagg', 'inverse_nagg', 'inverse_fshift', 'inverse_bshift', 'inverse_nshift'}
        The method used for projection of source flags onto field flags. See description above for more details.
    
    source : str
        The source source of flags projection.
    
    freq : {None, str},default None
        The freq determines the projection range for the projection method. See above description for more details.
        Defaultly (None), the sampling frequency of source is used.
    """
    pass


def calculatePolynomialResidues(field, winsz, polydeg, numba, eval_flags, min_periods, flag):
    """
    Function fits a polynomial model to the data and returns the residues.
    
    The residue for value x is calculated by fitting a polynomial of degree "polydeg" to a data slice
    of size "winsz", wich has x at its center.
    
    Note, that the residues will be stored to the `field` field of the input data, so that the original data, the
    polynomial is fitted to, gets overridden.
    
    Note, that, if data[field] is not alligned to an equidistant frequency grid, the window size passed,
    has to be an offset string. Also numba boost options don`t apply for irregularly sampled
    timeseries.
    
    Note, that calculating the residues tends to be quite costy, because a function fitting is perfomed for every
    sample. To improve performance, consider the following possibillities:
    
    In case your data is sampled at an equidistant frequency grid:
    
    (1) If you know your data to have no significant number of missing values, or if you do not want to
        calculate residues for windows containing missing values any way, performance can be increased by setting
        min_periods=winsz.
    
    (2) If your data consists of more then around 200000 samples, setting numba=True, will boost the
        calculations up to a factor of 5 (for samplesize > 300000) - however for lower sample sizes,
        numba will slow down the calculations, also, up to a factor of 5, for sample_size < 50000.
        By default (numba='auto'), numba is set to true, if the data sample size exceeds 200000.
    
    in case your data is not sampled at an equidistant frequency grid:
    
    (1) Harmonization/resampling of your data will have a noticable impact on polyfittings performance - since
        numba_boost doesnt apply for irregularly sampled data in the current implementation.
    
    Note, that in the current implementation, the initial and final winsz/2 values do not get fitted.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-modelled.
    winsz : {str, int}
        The size of the window you want to use for fitting. If an integer is passed, the size
        refers to the number of periods for every fitting window. If an offset string is passed,
        the size refers to the total temporal extension. The window will be centered around the vaule-to-be-fitted.
        For regularly sampled timeseries the period number will be casted down to an odd number if
        even.
    polydeg : int
        The degree of the polynomial used for fitting
    numba : {True, False, "auto"}, default "auto"
        Wheather or not to apply numbas just-in-time compilation onto the poly fit function. This will noticably
        increase the speed of calculation, if the sample size is sufficiently high.
        If "auto" is selected, numba compatible fit functions get applied for data consisiting of > 200000 samples.
    eval_flags : bool, default True
        Wheather or not to assign new flags to the calculated residuals. If True, a residual gets assigned the worst
        flag present in the interval, the data for its calculation was obtained from.
    min_periods : {int, None}, default 0
        The minimum number of periods, that has to be available in every values fitting surrounding for the polynomial
        fit to be performed. If there are not enough values, np.nan gets assigned. Default (0) results in fitting
        regardless of the number of values present (results in overfitting for too sparse intervals). To automatically
        set the minimum number of periods to the number of values in an offset defined window size, pass np.nan.
    flag : float, default BAD
        flag to set.
    """
    pass


def calculateRollingResidues():
    """
    TODO: docstring needed
    """
    pass


def roll(field, winsz, func, eval_flags, min_periods, center, flag):
    """
    Models the data with the rolling mean and returns the residues.
    
    Note, that the residues will be stored to the `field` field of the input data, so that the data that is modelled
    gets overridden.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-modelled.
    winsz : {int, str}
        The size of the window you want to roll with. If an integer is passed, the size
        refers to the number of periods for every fitting window. If an offset string is passed,
        the size refers to the total temporal extension.
        For regularly sampled timeseries, the period number will be casted down to an odd number if
        center = True.
    func : Callable[np.array, float], default np.mean
        Function to apply on the rolling window and obtain the curve fit value.
    eval_flags : bool, default True
        Wheather or not to assign new flags to the calculated residuals. If True, a residual gets assigned the worst
        flag present in the interval, the data for its calculation was obtained from.
        Currently not implemented in combination with not-harmonized timeseries.
    min_periods : int, default 0
        The minimum number of periods, that has to be available in every values fitting surrounding for the mean
        fitting to be performed. If there are not enough values, np.nan gets assigned. Default (0) results in fitting
        regardless of the number of values present.
    center : bool, default True
        Wheather or not to center the window the mean is calculated of around the reference value. If False,
        the reference value is placed to the right of the window (classic rolling mean with lag.)
    flag : float, default BAD
        flag to set.
    """
    pass


def assignKNNScore(field, n_neighbors, trafo, trafo_on_partition, scoring_func, target_field, partition_freq, partition_min, kNN_algorithm, metric, p, radius):
    """
    TODO: docstring need a rework
    Score datapoints by an aggregation of the dictances to their k nearest neighbors.
    
    The function is a wrapper around the NearestNeighbors method from pythons sklearn library (See reference [1]).
    
    The steps taken to calculate the scores are as follows:
    
    1. All the timeseries, named fields, are combined to one feature space by an *inner* join on their date time indexes.
       thus, only samples, that share timestamps across all fields will be included in the feature space
    2. Any datapoint/sample, where one ore more of the features is invalid (=np.nan) will get excluded.
    3. For every data point, the distance to its `n_neighbors` nearest neighbors is calculated by applying the
       metric `metric` at grade `p` onto the feature space. The defaults lead to the euclidian to be applied.
       If `radius` is not None, it sets the upper bound of distance for a neighbor to be considered one of the
       `n_neigbors` nearest neighbors. Furthermore, the `partition_freq` argument determines wich samples can be
       included into a datapoints nearest neighbors list, by segmenting the data into chunks of specified temporal
       extension and feeding that chunks to the kNN algorithm seperatly.
    4. For every datapoint, the calculated nearest neighbors distances get aggregated to a score, by the function
       passed to `scoring_func`. The default, ``sum`` obviously just sums up the distances.
    5. The resulting timeseries of scores gets assigned to the field target_field.
    
    Parameters
    ----------
    field : str
        The reference variable, the deviation from wich determines the flagging.
    n_neighbors : int, default 10
        The number of nearest neighbors to which the distance is comprised in every datapoints scoring calculation.
    trafo : Callable[np.array, np.array], default lambda x: x
        Transformation to apply on the variables before kNN scoring
    trafo_on_partition : bool, default True
        Weather or not to apply the transformation `trafo` onto the whole variable or onto each partition seperatly.
    scoring_func : Callable[numpy.array, float], default np.sum
        A function that assigns a score to every one dimensional array, containing the distances
        to every datapoints `n_neighbors` nearest neighbors.
    target_field : str, default 'kNN_scores'
        Name of the field, where the resulting scores should be written to.
    partition_freq : {np.inf, float, str}, default np.inf
        Determines the segmentation of the data into partitions, the kNN algorithm is
        applied onto individually.
    
        * ``np.inf``: Apply Scoring on whole data set at once
        * ``x`` > 0 : Apply scoring on successive data chunks of periods length ``x``
        * Offset String : Apply scoring on successive partitions of temporal extension matching the passed offset
          string
    
    partition_min : int, default 2
        The minimum number of periods that have to be present in a partition for the kNN scoring
        to be applied. If the number of periods present is below `partition_min`, the score for the
        datapoints in that partition will be np.nan.
    kNN_algorithm : {'ball_tree', 'kd_tree', 'brute', 'auto'}, default 'ball_tree'
        The search algorithm to find each datapoints k nearest neighbors.