Skip to content
Snippets Groups Projects
saqc.py 103 KiB
Newer Older
Peter Lünenschloß's avatar
Peter Lünenschloß committed
1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
    field : str
        Name of the column, holding the data-to-be-interpolated.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}
        The interpolation method to use.
    
    inter_order : int, default 2
        If there your selected interpolation method can be performed at different 'orders' - here you pass the desired
        order.
    
    inter_limit : int, default 2
        Maximum number of consecutive 'nan' values allowed for a gap to be interpolated. This really restricts the
        interpolation to chunks, containing not more than `inter_limit` successive nan entries.
    
    flag : float or None, default UNFLAGGED
        Flag that is set for interpolated values. If ``None``, no flags are set at all.
    
    downgrade_interpolation : bool, default False
        If `True` and the interpolation can not be performed at current order, retry with a lower order.
        This can happen, because the chosen ``method`` does not support the passed ``inter_order``, or
        simply because not enough values are present in a interval.
    """
    pass


def interpolateIndex(field, freq, method, inter_order, inter_limit, downgrade_interpolation):
    """
    Function to interpolate the data at regular (equidistant) timestamps (or Grid points).
    
    Note, that the interpolation will only be calculated, for grid timestamps that have a preceding AND a succeeding
    valid data value within "freq" range.
    
    Parameters
    ----------
    field : str
        Name of the column, holding the data-to-be-interpolated.
    
    freq : str
        An Offset String, interpreted as the frequency of
        the grid you want to interpolate your data at.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}: string
        The interpolation method you want to apply.
    
    inter_order : int, default 2
        If there your selected interpolation method can be performed at different 'orders' - here you pass the desired
        order.
    
    inter_limit : int, default 2
        Maximum number of consecutive 'nan' values allowed for a gap to be interpolated. This really restricts the
        interpolation to chunks, containing not more than `inter_limit` successive nan entries.
    
    downgrade_interpolation : bool, default False
        If `True` and the interpolation can not be performed at current order, retry with a lower order.
        This can happen, because the chosen ``method`` does not support the passed ``inter_order``, or
        simply because not enough values are present in a interval.
    
    """
    pass


def flagByStatLowPass(field):
    """
    Flag *chunks* of length, `winsz`:
    
    1. If they excexceed `thresh` with regard to `stat`:
    2. If all (maybe overlapping) *sub-chunks* of *chunk*, with length `sub_winsz`,
       `excexceed `sub_thresh` with regard to `stat`:
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    """
    pass


def flagByStray(field, partition_freq, partition_min, iter_start, alpha, flag):
    """
    Flag outliers in 1-dimensional (score) data with the STRAY Algorithm.
    
    Find more information on the algorithm in References [1].
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    partition_freq : str, int, or None, default None
        Determines the segmentation of the data into partitions, the kNN algorithm is
        applied onto individually.
    
        * ``np.inf``: Apply Scoring on whole data set at once
        * ``x`` > 0 : Apply scoring on successive data chunks of periods length ``x``
        * Offset String : Apply scoring on successive partitions of temporal extension matching the passed offset
          string
    
    partition_min : int, default 11
        Minimum number of periods per partition that have to be present for a valid outlier dettection to be made in
        this partition. (Only of effect, if `partition_freq` is an integer.) Partition min value must always be
        greater then the nn_neighbors value.
    
    iter_start : float, default 0.5
        Float in [0,1] that determines which percentage of data is considered "normal". 0.5 results in the stray
        algorithm to search only the upper 50 % of the scores for the cut off point. (See reference section for more
        information)
    
    alpha : float, default 0.05
        Level of significance by which it is tested, if a score might be drawn from another distribution, than the
        majority of the data.
    
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] Talagala, P. D., Hyndman, R. J., & Smith-Miles, K. (2019). Anomaly detection in high dimensional data.
        arXiv preprint arXiv:1908.04000.
    """
    pass


def flagMVScores(field, fields, trafo, alpha, n_neighbors, scoring_func, iter_start, stray_partition, stray_partition_min, trafo_on_partition, reduction_range, reduction_drop_flagged, reduction_thresh, reduction_min_periods, flag):
    """
    The algorithm implements a 3-step outlier detection procedure for simultaneously flagging of higher dimensional
    data (dimensions > 3).
    
    In references [1], the procedure is introduced and exemplified with an application on hydrological data.
    
    See the notes section for an overview over the algorithms basic steps.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    fields : List[str]
        List of fieldnames, corresponding to the variables that are to be included into the flagging process.
    trafo : callable, default lambda x:x
        Transformation to be applied onto every column before scoring. Will likely get deprecated soon. Its better
        to transform the data in a processing step, preceeeding the call to ``flagMVScores``.
    alpha : float, default 0.05
        Level of significance by which it is tested, if an observations score might be drawn from another distribution
        than the majority of the observation.
    n_neighbors : int, default 10
        Number of neighbors included in the scoring process for every datapoint.
    scoring_func : Callable[numpy.array, float], default np.sum
        The function that maps the set of every points k-nearest neighbor distances onto a certain scoring.
    iter_start : float, default 0.5
        Float in [0,1] that determines which percentage of data is considered "normal". 0.5 results in the threshing
        algorithm to search only the upper 50 % of the scores for the cut off point. (See reference section for more
        information)
    stray_partition : {None, str, int}, default None
        Only effective when `threshing` = 'stray'.
        Determines the size of the data partitions, the data is decomposed into. Each partition is checked seperately
        for outliers. If a String is passed, it has to be an offset string and it results in partitioning the data into
        parts of according temporal length. If an integer is passed, the data is simply split up into continous chunks
        of `partition_freq` periods. if ``None`` is passed (default), all the data will be tested in one run.
    stray_partition_min : int, default 11
        Only effective when `threshing` = 'stray'.
        Minimum number of periods per partition that have to be present for a valid outlier detection to be made in
        this partition. (Only of effect, if `stray_partition` is an integer.)
    trafo_on_partition : bool, default True
        Whether or not to apply the passed transformation on every partition the algorithm is applied on, separately.
    reduction_range : {None, str}, default None
        If not None, it is tried to reduce the stray result onto single outlier components of the input fields.
        An offset string, denoting the range of the temporal surrounding to include into the MAD testing while trying
        to reduce flags.
    reduction_drop_flagged : bool, default False
        Only effective when `reduction_range` is not ``None``.
        Whether or not to drop flagged values other than the value under test from the temporal surrounding
        before checking the value with MAD.
    reduction_thresh : float, default 3.5
        Only effective when `reduction_range` is not ``None``.
        The `critical` value, controlling wheather the MAD score is considered referring to an outlier or not.
        Higher values result in less rigid flagging. The default value is widely considered apropriate in the
        literature.
    reduction_min_periods : int, 1
        Only effective when `reduction_range` is not ``None``.
        Minimum number of meassurements necessarily present in a reduction interval for reduction actually to be
        performed.
    flag : float, default BAD
        flag to set.
    
    Notes
    -----
    The basic steps are:
    
    1. transforming
    
    The different data columns are transformed via timeseries transformations to
    (a) make them comparable and
    (b) make outliers more stand out.
    
    This step is usually subject to a phase of research/try and error. See [1] for more details.
    
    Note, that the data transformation as an built-in step of the algorithm, will likely get deprecated soon. Its better
    to transform the data in a processing step, preceeding the multivariate flagging process. Also, by doing so, one
    gets mutch more control and variety in the transformation applied, since the `trafo` parameter only allows for
    application of the same transformation to all of the variables involved.
    
    2. scoring
    
    Every observation gets assigned a score depending on its k nearest neighbors. See the `scoring_method` parameter
    description for details on the different scoring methods. Furthermore [1], [2] may give some insight in the
    pro and cons of the different methods.
    
    3. threshing
    
    The gaps between the (greatest) scores are tested for beeing drawn from the same
    distribution as the majority of the scores. If a gap is encountered, that, with sufficient significance, can be
    said to not be drawn from the same distribution as the one all the smaller gaps are drawn from, than
    the observation belonging to this gap, and all the observations belonging to gaps larger then this gap, get flagged
    outliers. See description of the `threshing` parameter for more details. Although [2] gives a fully detailed
    overview over the `stray` algorithm.
    """
    pass


def flagRaise(field, thresh, raise_window, intended_freq, average_window, mean_raise_factor, min_slope, min_slope_weight, numba_boost, flag):
    """
    The function flags raises and drops in value courses, that exceed a certain threshold
    within a certain timespan.
    
    The parameter variety of the function is owned to the intriguing
    case of values, that "return" from outlierish or anomalious value levels and
    thus exceed the threshold, while actually being usual values.
    
    NOTE, the dataset is NOT supposed to be harmonized to a time series with an
    equidistant frequency grid.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    thresh : float
        The threshold, for the total rise (thresh > 0), or total drop (thresh < 0), value courses must
        not exceed within a timespan of length `raise_window`.
    raise_window : str
        An offset string, determining the timespan, the rise/drop thresholding refers to. Window is inclusively defined.
    intended_freq : str
        An offset string, determining The frequency, the timeseries to-be-flagged is supposed to be sampled at.
        The window is inclusively defined.
    average_window : {None, str}, default None
        See condition (2) of the description linked in the references. Window is inclusively defined.
        The window defaults to 1.5 times the size of `raise_window`
    mean_raise_factor : float, default 2
        See second condition listed in the notes below.
    min_slope : {None, float}, default None
        See third condition listed in the notes below.
    min_slope_weight : float, default 0.8
        See third condition listed in the notes below.
    numba_boost : bool, default True
        deprecated ?
    flag : float, default BAD
        flag to set.
    
    Notes
    -----
    The value :math:`x_{k}` of a time series :math:`x` with associated
    timestamps :math:`t_i`, is flagged a raise, if:
    
    * There is any value :math:`x_{s}`, preceeding :math:`x_{k}` within `raise_window` range, so that:
    
      * :math:`M = |x_k - x_s | >`  `thresh` :math:`> 0`
    
    * The weighted average :math:`\mu^{*}` of the values, preceding :math:`x_{k}` within `average_window`
      range indicates, that :math:`x_{k}` does not return from an "outlierish" value course, meaning that:
    
      * :math:`x_k > \mu^* + ( M` / `mean_raise_factor` :math:`)`
    
    * Additionally, if `min_slope` is not `None`, :math:`x_{k}` is checked for being sufficiently divergent from its
      very predecessor :max:`x_{k-1}`$, meaning that, it is additionally checked if:
    
      * :math:`x_k - x_{k-1} >` `min_slope`
      * :math:`t_k - t_{k-1} >` `min_slope_weight` :math:`\times` `intended_freq`
    """
    pass


def flagMAD(field, window, flag):
    """
    The function represents an implementation of the modyfied Z-score outlier detection method.
    
    See references [1] for more details on the algorithm.
    
    Note, that the test needs the input data to be sampled regularly (fixed sampling rate).
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    window : str
       Offset string. Denoting the windows size that the "Z-scored" values have to lie in.
    z: float, default 3.5
        The value the Z-score is tested against. Defaulting to 3.5 (Recommendation of [1])
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
    """
    pass


def flagOffset(field, thresh, tolerance, window, rel_thresh, numba_kickin, flag):
    """
    A basic outlier test that is designed to work for harmonized and not harmonized data.
    
    The test classifies values/value courses as outliers by detecting not only a rise in value, but also,
    checking for a return to the initial value level.
    
    Values :math:`x_n, x_{n+1}, .... , x_{n+k}` of a timeseries :math:`x` with associated timestamps
    :math:`t_n, t_{n+1}, .... , t_{n+k}` are considered spikes, if
    
    1. :math:`|x_{n-1} - x_{n + s}| >` `thresh`, for all :math:`s \in [0,1,2,...,k]`
    
    2. :math:`|x_{n-1} - x_{n+k+1}| <` `tolerance`
    
    3. :math:`|t_{n-1} - t_{n+k+1}| <` `window`
    
    Note, that this definition of a "spike" not only includes one-value outliers, but also plateau-ish value courses.
    
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged. (Here a dummy, for structural reasons)
    thresh : float
        Minimum difference between to values, to consider the latter one as a spike. See condition (1)
    tolerance : float
        Maximum difference between pre-spike and post-spike values. See condition (2)
    window : {str, int}, default '15min'
        Maximum length of "spiky" value courses. See condition (3). Integer defined window length are only allowed for
        regularly sampled timeseries.
    rel_thresh : {float, None}, default None
        Relative threshold.
    numba_kickin : int, default 200000
        When there are detected more than `numba_kickin` incidents of potential spikes,
        the pandas.rolling - part of computation gets "jitted" with numba.
        Default value hast proven to be around the break even point between "jit-boost" and "jit-costs".
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    The implementation is a time-window based version of an outlier test from the UFZ Python library,
    that can be found here:
    
    https://git.ufz.de/chs/python/blob/master/ufz/level1/spike.py
    """
    pass


def flagByGrubbs(field, winsz, alpha, min_periods, flag):
    """
    The function flags values that are regarded outliers due to the grubbs test.
    
    See reference [1] for more information on the grubbs tests definition.
    
    The (two-sided) test gets applied onto data chunks of size "winsz". The tests application  will
    be iterated on each data-chunk under test, till no more outliers are detected in that chunk.
    
    Note, that the test performs poorely for small data chunks (resulting in heavy overflagging).
    Therefor you should select "winsz" so that every window contains at least > 8 values and also
    adjust the min_periods values accordingly.
    
    Note, that the data to be tested by the grubbs test are expected to be distributed "normalish".
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    winsz : {int, str}
        The size of the window you want to use for outlier testing. If an integer is passed, the size
        refers to the number of periods of every testing window. If a string is passed, it has to be an offset string,
        and will denote the total temporal extension of every window.
    alpha : float, default 0.05
        The level of significance, the grubbs test is to be performed at. (between 0 and 1)
    min_periods : int, default 8
        The minimum number of values that have to be present in an interval under test, for a grubbs test result to be
        accepted. Only makes sence in case `winsz` is an offset string.
    check_lagged: boolean, default False
        If True, every value gets checked twice for being an outlier. Ones in the initial rolling window and one more
        time in a rolling window that is lagged by half the windows delimeter (winsz/2). Recommended for avoiding false
        positives at the window edges. Only available when rolling with integer defined window size.
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    introduction to the grubbs test:
    
    [1] https://en.wikipedia.org/wiki/Grubbs%27s_test_for_outliers
    """
    pass


def flagRange(field, min, max, flag):
    """
    Function flags values not covered by the closed interval [`min`, `max`].
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-flagged.
    min : float
        Lower bound for valid data.
    max : float
        Upper bound for valid data.
    flag : float, default BAD
        flag to set.
    """
    pass


def flagCrossStatistic(field, fields, thresh, cross_stat, flag):
    """
    Function checks for outliers relatively to the "horizontal" input data axis.
    
    For `fields` :math:`=[f_1,f_2,...,f_N]` and timestamps :math:`[t_1,t_2,...,t_K]`, the following steps are taken
    for outlier detection:
    
    1. All timestamps :math:`t_i`, where there is one :math:`f_k`, with :math:`data[f_K]` having no entry at
       :math:`t_i`, are excluded from the following process (inner join of the :math:`f_i` fields.)
    2. for every :math:`0 <= i <= K`, the value
       :math:`m_j = median(\{data[f_1][t_i], data[f_2][t_i], ..., data[f_N][t_i]\})` is calculated
    2. for every :math:`0 <= i <= K`, the set
       :math:`\{data[f_1][t_i] - m_j, data[f_2][t_i] - m_j, ..., data[f_N][t_i] - m_j\}` is tested for outliers with the
       specified method (`cross_stat` parameter).
    
    Parameters
    ----------
    field : str
        A dummy parameter.
    fields : str
        List of fieldnames in data, determining wich variables are to be included into the flagging process.
    thresh : float
        Threshold which the outlier score of an value must exceed, for being flagged an outlier.
    cross_stat : {'modZscore', 'Zscore'}, default 'modZscore'
        Method used for calculating the outlier scores.
    
        * ``'modZscore'``: Median based "sigma"-ish approach. See Referenecs [1].
        * ``'Zscore'``: Score values by how many times the standard deviation they differ from the median.
          See References [1]
    
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm
    """
    pass


def flagPatternByDTW(field, flag):
    """
    Pattern recognition via wavelets.
    
    The steps are:
     1. work on chunks returned by a moving window
     2. each chunk is compared to the given pattern, using the wavelet algorithm as presented in [1]
     3. if the compared chunk is equal to the given pattern it gets flagged
    
    Parameters
    ----------
    
    field : str
        The fieldname of the data column, you want to correct.
    flag : float, default BAD
        flag to set.
    
    kwargs
    
    References
    ----------
    
    The underlying pattern recognition algorithm using wavelets is documented here:
    [1] Maharaj, E.A. (2002): Pattern Recognition of Time Series using Wavelets. In: Härdle W., Rönz B. (eds) Compstat. Physica, Heidelberg, 978-3-7908-1517-7.
    
    The documentation of the python package used for the wavelt decomposition can be found here:
    [2] https://pywavelets.readthedocs.io/en/latest/ref/cwt.html#continuous-wavelet-families
    """
    pass


def flagPatternByWavelet(field, flag):
    """
    Pattern Recognition via Dynamic Time Warping.
    
    The steps are:
     1. work on chunks returned by a moving window
     2. each chunk is compared to the given pattern, using the dynamic time warping algorithm as presented in [1]
     3. if the compared chunk is equal to the given pattern it gets flagged
    
    Parameters
    ----------
    
    field : str
        The fieldname of the data column, you want to correct.
    flag : float, default BAD
        flag to set.
    
    References
    ----------
    Find a nice description of underlying the Dynamic Time Warping Algorithm here:
    
    [1] https://cran.r-project.org/web/packages/dtw/dtw.pdf
    """
    pass


def aggregate(field, freq, value_func, flag_func, method, flag):
    """
    A method to "regularize" data by aggregating (resampling) data at a regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    The data will therefor get aggregated with a function, specified by the `value_func` parameter and
    the result gets projected onto the new timestamps with a method, specified by "method".
    
    The following method (keywords) are available:
    
    * ``'nagg'``: (aggreagtion to nearest) - all values in the range (+/- freq/2) of a grid point get aggregated with
      `agg_func`. and assigned to it. Flags get aggregated by `flag_func` and assigned the same way.
    * ``'bagg'``: (backwards aggregation) - all values in a sampling interval get aggregated with agg_func and the
      result gets assigned to the last regular timestamp. Flags get aggregated by `flag_func` and assigned the same way.
    * ``'fagg'``: (forward aggregation) - all values in a sampling interval get aggregated with agg_func and the result
      gets assigned to the next regular timestamp. Flags get aggregated by `flag_func` and assigned the same way.
    
    Note, that, if there is no valid data (exisitng and not-na) available in a sampling interval assigned to a regular
    timestamp by the selected method, nan gets assigned to this timestamp. The associated flag will be of value
    ``UNFLAGGED``.
    
    Note: the method will likely and significantly alter values and shape of ``data[field]``. The original data is kept
    in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        The sampling frequency the data is to be aggregated (resampled) at.
    
    value_func : Callable
        The function you want to use for aggregation.
    
    flag_func : Callable
        The function you want to aggregate the flags with. It should be capable of operating on the flags dtype
        (usually ordered categorical).
    
    method : {'fagg', 'bagg', 'nagg'}, default 'nagg'
        Specifies which intervals to be aggregated for a certain timestamp. (preceeding, succeeding or
        "surrounding" interval). See description above for more details.
    
    flag : float, default BAD
        flag to set.
    
    """
    pass


def linear(field, freq):
    """
    A method to "regularize" data by interpolating linearly the data at regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    Interpolated values will get assigned the worst flag within freq-range.
    
    Note: the method will likely and significantly alter values and shape of ``data[field]``. The original data is kept
    in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Note, that the data only gets interpolated at those (regular) timestamps, that have a valid (existing and
    not-na) datapoint preceeding them and one succeeding them within freq range.
    Regular timestamp that do not suffice this condition get nan assigned AND The associated flag will be of value
    ``UNFLAGGED``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        An offset string. The frequency of the grid you want to interpolate your data at.
    """
    pass


def interpolate(field, freq, method, order):
    """
    A method to "regularize" data by interpolating the data at regular timestamp.
    
    A series of data is considered "regular", if it is sampled regularly (= having uniform sampling rate).
    
    Interpolated values will get assigned the worst flag within freq-range.
    
    There are available all the interpolations from the pandas.Series.interpolate method and they are called by
    the very same keywords.
    
    Note, that, to perform a timestamp aware, linear interpolation, you have to pass ``'time'`` as `method`,
    and NOT ``'linear'``.
    
    Note: the `method` will likely and significantly alter values and shape of ``data[field]``. The original data is
    kept in the data dios and assigned to the fieldname ``field + '_original'``.
    
    Note, that the data only gets interpolated at those (regular) timestamps, that have a valid (existing and
    not-na) datapoint preceeding them and one succeeding them within freq range.
    Regular timestamp that do not suffice this condition get nan assigned AND The associated flag will be of value
    ``UNFLAGGED``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-regularized.
    
    freq : str
        An offset string. The frequency of the grid you want to interpolate your data at.
    
    method : {"linear", "time", "nearest", "zero", "slinear", "quadratic", "cubic", "spline", "barycentric",
        "polynomial", "krogh", "piecewise_polynomial", "spline", "pchip", "akima"}
        The interpolation method you want to apply.
    
    order : int, default 1
        If your selected interpolation method can be performed at different *orders* - here you pass the desired
        order.
    """
    pass


def mapToOriginal(field, method):
    """
    The Function function "undoes" regularization, by regaining the original data and projecting the
    flags calculated for the regularized data onto the original ones.
    
    Afterwards the regularized data is removed from the data dios and ``'field'`` will be associated
    with the original data "again".
    
    Wherever the flags in the original data are "better" then the regularized flags projected on them,
    they get overridden with this regularized flags value.
    
    Which regularized flags are to be projected on which original flags, is controlled by the "method" parameters.
    
    Generally, if you regularized with the method "X", you should pass the method "inverse_X" to the deharmonization.
    If you regularized with an interpolation, the method "inverse_interpolation" would be the appropriate choice.
    Also you should pass the same drop flags keyword.
    
    The deharm methods in detail:
    ("original_flags" are associated with the original data that is to be regained,
    "regularized_flags" are associated with the regularized data that is to be "deharmonized",
    "freq" refers to the regularized datas sampling frequencie)
    
    * ``'inverse_nagg'``: all original_flags within the range *+/- freq/2* of a regularized_flag, get assigned this
      regularized flags value. (if regularized_flags > original_flag)
    * ``'inverse_bagg'``: all original_flags succeeding a regularized_flag within the range of "freq", get assigned this
      regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_fagg'``: all original_flags preceeding a regularized_flag within the range of "freq", get assigned this
      regularized flags value. (if regularized_flag > original_flag)
    
    * ``'inverse_interpolation'``: all original_flags within the range *+/- freq* of a regularized_flag, get assigned this
      regularized flags value (if regularized_flag > original_flag).
    
    * ``'inverse_nshift'``: That original_flag within the range +/- *freq/2*, that is nearest to a regularized_flag,
      gets the regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_bshift'``: That original_flag succeeding a source flag within the range freq, that is nearest to a
      regularized_flag, gets assigned this regularized flags value. (if regularized_flag > original_flag)
    * ``'inverse_nshift'``: That original_flag preceeding a regularized flag within the range freq, that is nearest to a
      regularized_flag, gets assigned this regularized flags value. (if source_flag > original_flag)
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-deharmonized.
    
    method : {'inverse_fagg', 'inverse_bagg', 'inverse_nagg', 'inverse_fshift', 'inverse_bshift', 'inverse_nshift',
            'inverse_interpolation'}
        The method used for projection of regularized flags onto original flags. See description above for more
        details.
    """
    pass


def shift(field, freq, method, freq_check):
    """
    Function to shift data and flags to a regular (equidistant) timestamp grid, according to ``method``.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-shifted.
    
    freq : str
        An frequency Offset String that will be interpreted as the sampling rate you want the data to be shifted to.
    
    method : {'fshift', 'bshift', 'nshift'}, default 'nshift'
        Specifies how misaligned data-points get propagated to a grid timestamp.
        Following choices are available:
    
        * 'nshift' : every grid point gets assigned the nearest value in its range. (range = +/- 0.5 * `freq`)
        * 'bshift' : every grid point gets assigned its first succeeding value, if one is available in
          the succeeding sampling interval.
        * 'fshift' : every grid point gets assigned its ultimately preceding value, if one is available in
          the preceeding sampling interval.
    
    freq_check : {None, 'check', 'auto'}, default None
    
        * ``None`` : do not validate frequency-string passed to `freq`
        * 'check' : estimate frequency and log a warning if estimate miss matches frequency string passed to `freq`,
          or if no uniform sampling rate could be estimated
        * 'auto' : estimate frequency and use estimate. (Ignores `freq` parameter.)
    """
    pass


def resample(field, freq, agg_func, max_invalid_total_d, max_invalid_consec_d, max_invalid_total_f, max_invalid_consec_f, flag_agg_func, freq_check):
    """
    Function to resample the data. Afterwards the data will be sampled at regular (equidistant) timestamps
    (or Grid points). Sampling intervals therefor get aggregated with a function, specifyed by 'agg_func' parameter and
    the result gets projected onto the new timestamps with a method, specified by "method". The following method
    (keywords) are available:
    
    * ``'nagg'``: all values in the range (+/- `freq`/2) of a grid point get aggregated with agg_func and assigned to it.
    * ``'bagg'``: all values in a sampling interval get aggregated with agg_func and the result gets assigned to the last
      grid point.
    * ``'fagg'``: all values in a sampling interval get aggregated with agg_func and the result gets assigned to the next
      grid point.
    
    
    Note, that. if possible, functions passed to agg_func will get projected internally onto pandas.resample methods,
    wich results in some reasonable performance boost - however, for this to work, you should pass functions that have
    the __name__ attribute initialised and the according methods name assigned to it.
    Furthermore, you shouldnt pass numpys nan-functions
    (``nansum``, ``nanmean``,...) because those for example, have ``__name__ == 'nansum'`` and they will thus not
    trigger ``resample.func()``, but the slower ``resample.apply(nanfunc)``. Also, internally, no nans get passed to
    the functions anyway, so that there is no point in passing the nan functions.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-resampled.
    
    freq : str
        An Offset String, that will be interpreted as the frequency you want to resample your data with.
    
    agg_func : Callable
        The function you want to use for aggregation.
    
    method: {'fagg', 'bagg', 'nagg'}, default 'bagg'
        Specifies which intervals to be aggregated for a certain timestamp. (preceding, succeeding or
        "surrounding" interval). See description above for more details.
    
    max_invalid_total_d : {None, int}, default None
        Maximum number of invalid (nan) datapoints, allowed per resampling interval. If max_invalid_total_d is
        exceeded, the interval gets resampled to nan. By default (``np.inf``), there is no bound to the number of nan
        values in an interval and only intervals containing ONLY nan values or those, containing no values at all,
        get projected onto nan
    
    max_invalid_consec_d : {None, int}, default None
        Maximum number of consecutive invalid (nan) data points, allowed per resampling interval.
        If max_invalid_consec_d is exceeded, the interval gets resampled to nan. By default (np.inf),
        there is no bound to the number of consecutive nan values in an interval and only intervals
        containing ONLY nan values, or those containing no values at all, get projected onto nan.
    
    max_invalid_total_f : {None, int}, default None
        Same as `max_invalid_total_d`, only applying for the flags. The flag regarded as "invalid" value,
        is the one passed to empty_intervals_flag (default=``BAD``).
        Also this is the flag assigned to invalid/empty intervals.
    
    max_invalid_consec_f : {None, int}, default None
        Same as `max_invalid_total_f`, only applying onto flags. The flag regarded as "invalid" value, is the one passed
        to empty_intervals_flag. Also this is the flag assigned to invalid/empty intervals.
    
    flag_agg_func : Callable, default: max
        The function you want to aggregate the flags with. It should be capable of operating on the flags dtype
        (usually ordered categorical).
    
    freq_check : {None, 'check', 'auto'}, default None
    
        * ``None``: do not validate frequency-string passed to `freq`
        * ``'check'``: estimate frequency and log a warning if estimate miss matchs frequency string passed to 'freq', or
          if no uniform sampling rate could be estimated
        * ``'auto'``: estimate frequency and use estimate. (Ignores `freq` parameter.)
    """
    pass


def reindexFlags(field, method, source, freq):
    """
    The Function projects flags of "source" onto flags of "field". Wherever the "field" flags are "better" then the
    source flags projected on them, they get overridden with this associated source flag value.
    
    Which "field"-flags are to be projected on which source flags, is controlled by the "method" and "freq"
    parameters.
    
    method: (field_flag in associated with "field", source_flags associated with "source")
    
    'inverse_nagg' - all field_flags within the range +/- freq/2 of a source_flag, get assigned this source flags value.
        (if source_flag > field_flag)
    'inverse_bagg' - all field_flags succeeding a source_flag within the range of "freq", get assigned this source flags
        value. (if source_flag > field_flag)
    'inverse_fagg' - all field_flags preceeding a source_flag within the range of "freq", get assigned this source flags
        value. (if source_flag > field_flag)
    
    'inverse_interpolation' - all field_flags within the range +/- freq of a source_flag, get assigned this source flags value.
        (if source_flag > field_flag)
    
    'inverse_nshift' - That field_flag within the range +/- freq/2, that is nearest to a source_flag, gets the source
        flags value. (if source_flag > field_flag)
    'inverse_bshift' - That field_flag succeeding a source flag within the range freq, that is nearest to a
        source_flag, gets assigned this source flags value. (if source_flag > field_flag)
    'inverse_nshift' - That field_flag preceeding a source flag within the range freq, that is nearest to a
        source_flag, gets assigned this source flags value. (if source_flag > field_flag)
    
    'match' - any field_flag with a timestamp matching a source_flags timestamp gets this source_flags value
    (if source_flag > field_flag)
    
    Note, to undo or backtrack a resampling/shifting/interpolation that has been performed with a certain method,
    you can just pass the associated "inverse" method. Also you should pass the same drop flags keyword.
    
    Parameters
    ----------
    field : str
        The fieldname of the data column, you want to project the source-flags onto.
    
    method : {'inverse_fagg', 'inverse_bagg', 'inverse_nagg', 'inverse_fshift', 'inverse_bshift', 'inverse_nshift'}
        The method used for projection of source flags onto field flags. See description above for more details.
    
    source : str
        The source source of flags projection.
    
    freq : {None, str},default None
        The freq determines the projection range for the projection method. See above description for more details.
        Defaultly (None), the sampling frequency of source is used.
    """
    pass


def calculatePolynomialResidues(field, winsz, polydeg, numba, eval_flags, min_periods, flag):
    """
    Function fits a polynomial model to the data and returns the residues.
    
    The residue for value x is calculated by fitting a polynomial of degree "polydeg" to a data slice
    of size "winsz", wich has x at its center.
    
    Note, that the residues will be stored to the `field` field of the input data, so that the original data, the
    polynomial is fitted to, gets overridden.
    
    Note, that, if data[field] is not alligned to an equidistant frequency grid, the window size passed,
    has to be an offset string. Also numba boost options don`t apply for irregularly sampled
    timeseries.
    
    Note, that calculating the residues tends to be quite costy, because a function fitting is perfomed for every
    sample. To improve performance, consider the following possibillities:
    
    In case your data is sampled at an equidistant frequency grid:
    
    (1) If you know your data to have no significant number of missing values, or if you do not want to
        calculate residues for windows containing missing values any way, performance can be increased by setting
        min_periods=winsz.
    
    (2) If your data consists of more then around 200000 samples, setting numba=True, will boost the
        calculations up to a factor of 5 (for samplesize > 300000) - however for lower sample sizes,
        numba will slow down the calculations, also, up to a factor of 5, for sample_size < 50000.
        By default (numba='auto'), numba is set to true, if the data sample size exceeds 200000.
    
    in case your data is not sampled at an equidistant frequency grid:
    
    (1) Harmonization/resampling of your data will have a noticable impact on polyfittings performance - since
        numba_boost doesnt apply for irregularly sampled data in the current implementation.
    
    Note, that in the current implementation, the initial and final winsz/2 values do not get fitted.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-modelled.
    winsz : {str, int}
        The size of the window you want to use for fitting. If an integer is passed, the size
        refers to the number of periods for every fitting window. If an offset string is passed,
        the size refers to the total temporal extension. The window will be centered around the vaule-to-be-fitted.
        For regularly sampled timeseries the period number will be casted down to an odd number if
        even.
    polydeg : int
        The degree of the polynomial used for fitting
    numba : {True, False, "auto"}, default "auto"
        Wheather or not to apply numbas just-in-time compilation onto the poly fit function. This will noticably
        increase the speed of calculation, if the sample size is sufficiently high.
        If "auto" is selected, numba compatible fit functions get applied for data consisiting of > 200000 samples.
    eval_flags : bool, default True
        Wheather or not to assign new flags to the calculated residuals. If True, a residual gets assigned the worst
        flag present in the interval, the data for its calculation was obtained from.
    min_periods : {int, None}, default 0
        The minimum number of periods, that has to be available in every values fitting surrounding for the polynomial
        fit to be performed. If there are not enough values, np.nan gets assigned. Default (0) results in fitting
        regardless of the number of values present (results in overfitting for too sparse intervals). To automatically
        set the minimum number of periods to the number of values in an offset defined window size, pass np.nan.
    flag : float, default BAD
        flag to set.
    """
    pass


def calculateRollingResidues():
    """
    TODO: docstring needed
    """
    pass


def roll(field, winsz, func, eval_flags, min_periods, center, flag):
    """
    Models the data with the rolling mean and returns the residues.
    
    Note, that the residues will be stored to the `field` field of the input data, so that the data that is modelled
    gets overridden.
    
    Parameters
    ----------
    field : str
        The fieldname of the column, holding the data-to-be-modelled.
    winsz : {int, str}
        The size of the window you want to roll with. If an integer is passed, the size
        refers to the number of periods for every fitting window. If an offset string is passed,
        the size refers to the total temporal extension.
        For regularly sampled timeseries, the period number will be casted down to an odd number if
        center = True.
    func : Callable[np.array, float], default np.mean
        Function to apply on the rolling window and obtain the curve fit value.
    eval_flags : bool, default True
        Wheather or not to assign new flags to the calculated residuals. If True, a residual gets assigned the worst
        flag present in the interval, the data for its calculation was obtained from.
        Currently not implemented in combination with not-harmonized timeseries.
    min_periods : int, default 0
        The minimum number of periods, that has to be available in every values fitting surrounding for the mean
        fitting to be performed. If there are not enough values, np.nan gets assigned. Default (0) results in fitting
        regardless of the number of values present.
    center : bool, default True
        Wheather or not to center the window the mean is calculated of around the reference value. If False,
        the reference value is placed to the right of the window (classic rolling mean with lag.)
    flag : float, default BAD
        flag to set.
    """
    pass


def assignKNNScore(field, n_neighbors, trafo, trafo_on_partition, scoring_func, target_field, partition_freq, partition_min, kNN_algorithm, metric, p, radius):
    """
    TODO: docstring need a rework
    Score datapoints by an aggregation of the dictances to their k nearest neighbors.
    
    The function is a wrapper around the NearestNeighbors method from pythons sklearn library (See reference [1]).
    
    The steps taken to calculate the scores are as follows:
    
    1. All the timeseries, named fields, are combined to one feature space by an *inner* join on their date time indexes.
       thus, only samples, that share timestamps across all fields will be included in the feature space
    2. Any datapoint/sample, where one ore more of the features is invalid (=np.nan) will get excluded.
    3. For every data point, the distance to its `n_neighbors` nearest neighbors is calculated by applying the
       metric `metric` at grade `p` onto the feature space. The defaults lead to the euclidian to be applied.
       If `radius` is not None, it sets the upper bound of distance for a neighbor to be considered one of the
       `n_neigbors` nearest neighbors. Furthermore, the `partition_freq` argument determines wich samples can be
       included into a datapoints nearest neighbors list, by segmenting the data into chunks of specified temporal
       extension and feeding that chunks to the kNN algorithm seperatly.
    4. For every datapoint, the calculated nearest neighbors distances get aggregated to a score, by the function
       passed to `scoring_func`. The default, ``sum`` obviously just sums up the distances.
    5. The resulting timeseries of scores gets assigned to the field target_field.
    
    Parameters
    ----------
    field : str
        The reference variable, the deviation from wich determines the flagging.
    n_neighbors : int, default 10
        The number of nearest neighbors to which the distance is comprised in every datapoints scoring calculation.
    trafo : Callable[np.array, np.array], default lambda x: x
        Transformation to apply on the variables before kNN scoring
    trafo_on_partition : bool, default True
        Weather or not to apply the transformation `trafo` onto the whole variable or onto each partition seperatly.
    scoring_func : Callable[numpy.array, float], default np.sum
        A function that assigns a score to every one dimensional array, containing the distances
        to every datapoints `n_neighbors` nearest neighbors.
    target_field : str, default 'kNN_scores'
        Name of the field, where the resulting scores should be written to.
    partition_freq : {np.inf, float, str}, default np.inf
        Determines the segmentation of the data into partitions, the kNN algorithm is
        applied onto individually.
    
        * ``np.inf``: Apply Scoring on whole data set at once
        * ``x`` > 0 : Apply scoring on successive data chunks of periods length ``x``
        * Offset String : Apply scoring on successive partitions of temporal extension matching the passed offset
          string
    
    partition_min : int, default 2
        The minimum number of periods that have to be present in a partition for the kNN scoring
        to be applied. If the number of periods present is below `partition_min`, the score for the
        datapoints in that partition will be np.nan.
    kNN_algorithm : {'ball_tree', 'kd_tree', 'brute', 'auto'}, default 'ball_tree'
        The search algorithm to find each datapoints k nearest neighbors.