Skip to content
Snippets Groups Projects
saqc.py 103 KiB
Newer Older
Peter Lünenschloß's avatar
Peter Lünenschloß committed
        The keyword just gets passed on to the underlying sklearn method.
        See reference [1] for more information on the algorithm.
    metric : str, default 'minkowski'
        The metric the distances to any datapoints neighbors is computed with. The default of `metric`
        together with the default of `p` result in the euclidian to be applied.
        The keyword just gets passed on to the underlying sklearn method.
        See reference [1] for more information on the algorithm.
    p : int, default 2
        The grade of the metrice specified by parameter `metric`.
        The keyword just gets passed on to the underlying sklearn method.
        See reference [1] for more information on the algorithm.
    radius : {None, float}, default None
        If the radius is not None, only the distance to neighbors that ly within the range specified by `radius`
        are comprised in the scoring aggregation.
        The scoring method passed must be capable of handling np.nan values - since, for every point missing
        within `radius` range to make complete the list of the distances to the `n_neighbors` nearest neighbors,
        one np.nan value gets appended to the list passed to the scoring method.
        The keyword just gets passed on to the underlying sklearn method.
        See reference [1] for more information on the algorithm.
    
    References
    ----------
    [1] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
    """
    pass


def copy(field):
    """
    The function generates a copy of the data "field" and inserts it under the name field + suffix into the existing
    data.
    
    Parameters
    ----------
    field : str
        The fieldname of the data column, you want to fork (copy).
    """
    pass


def drop(field):
    """
    The function drops field from the data dios and the flags.
    
    Parameters
    ----------
    field : str
        The fieldname of the data column, you want to drop.
    """
    pass


def rename(field, new_name):
    """
    The function renames field to new name (in both, the flags and the data).
    
    Parameters
    ----------
    field : str
        The fieldname of the data column, you want to rename.
    new_name : str
        String, field is to be replaced with.
    """
    pass


def mask(field, mode, mask_var, period_start, period_end, include_bounds):
    """
    This function realizes masking within saqc.
    
    Due to some inner saqc mechanics, it is not straight forwardly possible to exclude
    values or datachunks from flagging routines. This function replaces flags with UNFLAGGED
    value, wherever values are to get masked. Furthermore, the masked values get replaced by
    np.nan, so that they dont effect calculations.
    
    Here comes a recipe on how to apply a flagging function only on a masked chunk of the variable field:
    
    1. dublicate "field" in the input data (copy)
    2. mask the dublicated data (mask)
    3. apply the tests you only want to be applied onto the masked data chunks (saqc_tests)
    4. project the flags, calculated on the dublicated and masked data onto the original field data
        (projectFlags or flagGeneric)
    5. drop the dublicated data (drop)
    
    To see an implemented example, checkout flagSeasonalRange in the saqc.functions module
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-masked.
    mode : {"periodic", "mask_var"}
        The masking mode.
        - "periodic": parameters "period_start", "period_end" are evaluated to generate a periodical mask
        - "mask_var": data[mask_var] is expected to be a boolean valued timeseries and is used as mask.
    mask_var : {None, str}, default None
        Only effective if mode == "mask_var"
        Fieldname of the column, holding the data that is to be used as mask. (must be moolean series)
        Neither the series` length nor its labels have to match data[field]`s index and length. An inner join of the
        indices will be calculated and values get masked where the values of the inner join are "True".
    period_start : {None, str}, default None
        Only effective if mode == "seasonal"
        String denoting starting point of every period. Formally, it has to be a truncated instance of "mm-ddTHH:MM:SS".
        Has to be of same length as `period_end` parameter.
        See examples section below for some examples.
    period_end : {None, str}, default None
        Only effective if mode == "periodic"
        String denoting starting point of every period. Formally, it has to be a truncated instance of "mm-ddTHH:MM:SS".
        Has to be of same length as `period_end` parameter.
        See examples section below for some examples.
    include_bounds : boolean
        Wheather or not to include the mask defining bounds to the mask.
    
    Examples
    --------
    The `period_start` and `period_end` parameters provide a conveniant way to generate seasonal / date-periodic masks.
    They have to be strings of the forms: "mm-ddTHH:MM:SS", "ddTHH:MM:SS" , "HH:MM:SS", "MM:SS" or "SS"
    (mm=month, dd=day, HH=hour, MM=minute, SS=second)
    Single digit specifications have to be given with leading zeros.
    `period_start` and `seas   on_end` strings have to be of same length (refer to the same periodicity)
    The highest date unit gives the period.
    For example:
    
    >>> period_start = "01T15:00:00"
    >>> period_end = "13T17:30:00"
    
    Will result in all values sampled between 15:00 at the first and  17:30 at the 13th of every month get masked
    
    >>> period_start = "01:00"
    >>> period_end = "04:00"
    
    All the values between the first and 4th minute of every hour get masked.
    
    >>> period_start = "01-01T00:00:00"
    >>> period_end = "01-03T00:00:00"
    
    Mask january and february of evcomprosed in theery year. masking is inclusive always, so in this case the mask will
    include 00:00:00 at the first of march. To exclude this one, pass:
    
    >>> period_start = "01-01T00:00:00"
    >>> period_end = "02-28T23:59:59"
    
    To mask intervals that lap over a seasons frame, like nights, or winter, exchange sequence of season start and
    season end. For example, to mask night hours between 22:00:00 in the evening and 06:00:00 in the morning, pass:
    
    >>> period_start = "22:00:00"
    >>> period_end = "06:00:00"
    
    When inclusive_selection="season", all above examples work the same way, only that you now
    determine wich values NOT TO mask (=wich values are to constitute the "seasons").
    """
    pass


def plot(field, save_path, max_gap, stats, plot_kwargs, fig_kwargs, save_kwargs):
    """
    Stores or shows a figure object, containing data graph with flag marks for field.
    
    Parameters
    ----------
    field : str
        Name of the variable-to-plot
    save_path : str, default ''
        Path to the location where the figure shall be stored to. If '' is passed, interactive mode is accessed instead
        of figure storage.
    max_gap : {None, str}, default None:
        If None, all the points in the data will be connected, resulting in long linear lines, where continous chunks
        of data is missing. (nans in the data get dropped before plotting.)
        If an Offset string is passed, only points that have a distance below `max_gap` get connected via the plotting
        line.
    stats : bool, default False
        Whether to include statistics table in plot.
    plot_kwargs : dict, default {}
        Keyword arguments controlling plot generation. Will be passed on to the ``Matplotlib.axes.Axes.set()`` property
        batch setter for the axes showing the data plot. The most relevant of those properties might be "ylabel",
        "title" and "ylim".
        In Addition, following options are available:
    
        * {'slice': s} property, that determines a chunk of the data to be plotted / processed. `s` can be anything,
          that is a valid argument to the ``pandas.Series.__getitem__`` method.
        * {'history': str}
            * str="all": All the flags are plotted with colored dots, refering to the tests they originate from
            * str="valid": - same as 'all' - but only plots those flags, that are not removed by later tests
    
    fig_kwargs : dict, default {"figsize": (16, 9)}
        Keyword arguments controlling figure generation.
    save_kwargs : dict, default {}
        Keywords to be passed on to the ``matplotlib.pyplot.savefig`` method, handling the figure storing.
        NOTE: To store an pickle, that can be used to regain an interactive figure window, use the option
        {'pickle': True}. This will result in all the other save_kwargs to be ignored.
        To enter interactive mode for a pickled figure, simply do: pickle.load(open(savepath,'w')).show()
    stats_dict: Optional[dict] = {}
        Dictionary of additional statisticts to write to the statistics table accompanying the data plot.
        (Only relevant if `stats`=True). An entry to the stats_dict has to be of the form:
    
        * {"stat_name": lambda x, y, z: func(x, y, z)}
    
        The lambda args ``x``,``y``,``z`` will be fed by:
    
        * ``x``: the data (``data[field]``).
        * ``y``: the flags (``flags[field]``).
        * ``z``: The passed flags level (``kwargs[flag]``)
    
        See examples section for examples
    
    Examples
    --------
    Summary statistic function examples:
    
    >>> func = lambda x, y, z: len(x)
    
    Total number of nan-values:
    
    >>> func = lambda x, y, z: x.isna().sum()
    
    Percentage of values, flagged greater than passed flag (always round float results to avoid table cell overflow):
    
    >>> func = lambda x, y, z: round((x.isna().sum()) / len(x), 2)
    """
    pass


def transform(field, func, partition_freq):
    """
    Function to transform data columns with a transformation that maps series onto series of the same length.
    
    Note, that flags get preserved.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-transformed.
    func : Callable[{pd.Series, np.array}, np.array]
        Function to transform data[field] with.
    partition_freq : {None, float, str}, default None
        Determines the segmentation of the data into partitions, the transformation is applied on individually
    
        * ``np.inf``: Apply transformation on whole data set at once
        * ``x`` > 0 : Apply transformation on successive data chunks of periods length ``x``
        * Offset String : Apply transformation on successive partitions of temporal extension matching the passed offset
          string
    """
    pass