Newer
Older
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
The keyword just gets passed on to the underlying sklearn method.
See reference [1] for more information on the algorithm.
metric : str, default 'minkowski'
The metric the distances to any datapoints neighbors is computed with. The default of `metric`
together with the default of `p` result in the euclidian to be applied.
The keyword just gets passed on to the underlying sklearn method.
See reference [1] for more information on the algorithm.
p : int, default 2
The grade of the metrice specified by parameter `metric`.
The keyword just gets passed on to the underlying sklearn method.
See reference [1] for more information on the algorithm.
radius : {None, float}, default None
If the radius is not None, only the distance to neighbors that ly within the range specified by `radius`
are comprised in the scoring aggregation.
The scoring method passed must be capable of handling np.nan values - since, for every point missing
within `radius` range to make complete the list of the distances to the `n_neighbors` nearest neighbors,
one np.nan value gets appended to the list passed to the scoring method.
The keyword just gets passed on to the underlying sklearn method.
See reference [1] for more information on the algorithm.
References
----------
[1] https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
"""
pass
def copy(field):
"""
The function generates a copy of the data "field" and inserts it under the name field + suffix into the existing
data.
Parameters
----------
field : str
The fieldname of the data column, you want to fork (copy).
"""
pass
def drop(field):
"""
The function drops field from the data dios and the flags.
Parameters
----------
field : str
The fieldname of the data column, you want to drop.
"""
pass
def rename(field, new_name):
"""
The function renames field to new name (in both, the flags and the data).
Parameters
----------
field : str
The fieldname of the data column, you want to rename.
new_name : str
String, field is to be replaced with.
"""
pass
def mask(field, mode, mask_var, period_start, period_end, include_bounds):
"""
This function realizes masking within saqc.
Due to some inner saqc mechanics, it is not straight forwardly possible to exclude
values or datachunks from flagging routines. This function replaces flags with UNFLAGGED
value, wherever values are to get masked. Furthermore, the masked values get replaced by
np.nan, so that they dont effect calculations.
Here comes a recipe on how to apply a flagging function only on a masked chunk of the variable field:
1. dublicate "field" in the input data (copy)
2. mask the dublicated data (mask)
3. apply the tests you only want to be applied onto the masked data chunks (saqc_tests)
4. project the flags, calculated on the dublicated and masked data onto the original field data
(projectFlags or flagGeneric)
5. drop the dublicated data (drop)
To see an implemented example, checkout flagSeasonalRange in the saqc.functions module
Parameters
----------
field : str
The fieldname of the column, holding the data-to-be-masked.
mode : {"periodic", "mask_var"}
The masking mode.
- "periodic": parameters "period_start", "period_end" are evaluated to generate a periodical mask
- "mask_var": data[mask_var] is expected to be a boolean valued timeseries and is used as mask.
mask_var : {None, str}, default None
Only effective if mode == "mask_var"
Fieldname of the column, holding the data that is to be used as mask. (must be moolean series)
Neither the series` length nor its labels have to match data[field]`s index and length. An inner join of the
indices will be calculated and values get masked where the values of the inner join are "True".
period_start : {None, str}, default None
Only effective if mode == "seasonal"
String denoting starting point of every period. Formally, it has to be a truncated instance of "mm-ddTHH:MM:SS".
Has to be of same length as `period_end` parameter.
See examples section below for some examples.
period_end : {None, str}, default None
Only effective if mode == "periodic"
String denoting starting point of every period. Formally, it has to be a truncated instance of "mm-ddTHH:MM:SS".
Has to be of same length as `period_end` parameter.
See examples section below for some examples.
include_bounds : boolean
Wheather or not to include the mask defining bounds to the mask.
Examples
--------
The `period_start` and `period_end` parameters provide a conveniant way to generate seasonal / date-periodic masks.
They have to be strings of the forms: "mm-ddTHH:MM:SS", "ddTHH:MM:SS" , "HH:MM:SS", "MM:SS" or "SS"
(mm=month, dd=day, HH=hour, MM=minute, SS=second)
Single digit specifications have to be given with leading zeros.
`period_start` and `seas on_end` strings have to be of same length (refer to the same periodicity)
The highest date unit gives the period.
For example:
>>> period_start = "01T15:00:00"
>>> period_end = "13T17:30:00"
Will result in all values sampled between 15:00 at the first and 17:30 at the 13th of every month get masked
>>> period_start = "01:00"
>>> period_end = "04:00"
All the values between the first and 4th minute of every hour get masked.
>>> period_start = "01-01T00:00:00"
>>> period_end = "01-03T00:00:00"
Mask january and february of evcomprosed in theery year. masking is inclusive always, so in this case the mask will
include 00:00:00 at the first of march. To exclude this one, pass:
>>> period_start = "01-01T00:00:00"
>>> period_end = "02-28T23:59:59"
To mask intervals that lap over a seasons frame, like nights, or winter, exchange sequence of season start and
season end. For example, to mask night hours between 22:00:00 in the evening and 06:00:00 in the morning, pass:
>>> period_start = "22:00:00"
>>> period_end = "06:00:00"
When inclusive_selection="season", all above examples work the same way, only that you now
determine wich values NOT TO mask (=wich values are to constitute the "seasons").
"""
pass
def plot(field, save_path, max_gap, stats, plot_kwargs, fig_kwargs, save_kwargs):
"""
Stores or shows a figure object, containing data graph with flag marks for field.
Parameters
----------
field : str
Name of the variable-to-plot
save_path : str, default ''
Path to the location where the figure shall be stored to. If '' is passed, interactive mode is accessed instead
of figure storage.
max_gap : {None, str}, default None:
If None, all the points in the data will be connected, resulting in long linear lines, where continous chunks
of data is missing. (nans in the data get dropped before plotting.)
If an Offset string is passed, only points that have a distance below `max_gap` get connected via the plotting
line.
stats : bool, default False
Whether to include statistics table in plot.
plot_kwargs : dict, default {}
Keyword arguments controlling plot generation. Will be passed on to the ``Matplotlib.axes.Axes.set()`` property
batch setter for the axes showing the data plot. The most relevant of those properties might be "ylabel",
"title" and "ylim".
In Addition, following options are available:
* {'slice': s} property, that determines a chunk of the data to be plotted / processed. `s` can be anything,
that is a valid argument to the ``pandas.Series.__getitem__`` method.
* {'history': str}
* str="all": All the flags are plotted with colored dots, refering to the tests they originate from
* str="valid": - same as 'all' - but only plots those flags, that are not removed by later tests
fig_kwargs : dict, default {"figsize": (16, 9)}
Keyword arguments controlling figure generation.
save_kwargs : dict, default {}
Keywords to be passed on to the ``matplotlib.pyplot.savefig`` method, handling the figure storing.
NOTE: To store an pickle, that can be used to regain an interactive figure window, use the option
{'pickle': True}. This will result in all the other save_kwargs to be ignored.
To enter interactive mode for a pickled figure, simply do: pickle.load(open(savepath,'w')).show()
stats_dict: Optional[dict] = {}
Dictionary of additional statisticts to write to the statistics table accompanying the data plot.
(Only relevant if `stats`=True). An entry to the stats_dict has to be of the form:
* {"stat_name": lambda x, y, z: func(x, y, z)}
The lambda args ``x``,``y``,``z`` will be fed by:
* ``x``: the data (``data[field]``).
* ``y``: the flags (``flags[field]``).
* ``z``: The passed flags level (``kwargs[flag]``)
See examples section for examples
Examples
--------
Summary statistic function examples:
>>> func = lambda x, y, z: len(x)
Total number of nan-values:
>>> func = lambda x, y, z: x.isna().sum()
Percentage of values, flagged greater than passed flag (always round float results to avoid table cell overflow):
>>> func = lambda x, y, z: round((x.isna().sum()) / len(x), 2)
"""
pass
def transform(field, func, partition_freq):
"""
Function to transform data columns with a transformation that maps series onto series of the same length.
Note, that flags get preserved.
Parameters
----------
field : str
The fieldname of the column, holding the data-to-be-transformed.
func : Callable[{pd.Series, np.array}, np.array]
Function to transform data[field] with.
partition_freq : {None, float, str}, default None
Determines the segmentation of the data into partitions, the transformation is applied on individually
* ``np.inf``: Apply transformation on whole data set at once
* ``x`` > 0 : Apply transformation on successive data chunks of periods length ``x``
* Offset String : Apply transformation on successive partitions of temporal extension matching the passed offset
string
"""
pass