From b367bf710bbb9d351827c83e4ac93ca9bb2a7ef0 Mon Sep 17 00:00:00 2001 From: Bert Palm <bert.palm@ufz.de> Date: Wed, 25 Mar 2020 01:46:15 +0100 Subject: [PATCH] docdocdoc, with nicy linkies --- Readme.md | 192 +++---------------------- docs/aloc_usage.md | 256 +++++++++++++++++++++++++++++++-- docs/cookbook.md | 7 + docs/methods_and_properties.md | 8 +- 4 files changed, 273 insertions(+), 190 deletions(-) diff --git a/Readme.md b/Readme.md index 02e1bac..41fce8a 100644 --- a/Readme.md +++ b/Readme.md @@ -22,10 +22,9 @@ Features * additional align locator (`.aloc[]`) -Indexing --------- -**pandas-like indexing** +Pandas-like indexing +-------------------- `[]` and `.loc[]`, `.iloc[]` and `.at[]`, `.iat[]` - should behave exactly like their counter-parts from pd.DataFrame. They can take as indexer @@ -69,185 +68,34 @@ fill np.nans at missing locations and therefore also fill-up, whole missing colu Setting values with `[]` and `.loc[]`, `.iloc[]` and `.at[]`, `.iat[]` works like in pandas. With `.at`/`.iat` only single items can be set, for the other the right hand side values can be: -- *scalars*: these are broadcasted to the selected positions -- *nested lists*: the length of the outer list must match the number of indexed columns, the lengths of the inner lists must match the number of selected rows. -- *dios*: the length of the columns must match the number of indexed columns - columns does *not* align, - they are just iterated. - Rows do align. Rows that are present on the right but not on the left are ignored. - Rows that are present on the left (bear in mind: these rows was explicitly chosen for write!), but not present - on the right, are filled with `NaN`s, like in pandas. -- *normal lists* : column keys must be a scalar(!), the list is passed down, and set with `loc`, `iloc` or `[]` by pandas Series. -- *pd.Series*: column indexer must be a scalar(!), the series is passed down, and set with `loc`, `iloc` or `[]` - by pandas Series, where it maybe align, depending on the method. - -Examples: + - *scalars*: these are broadcasted to the selected positions + - *nested lists*: the length of the outer list must match the number of indexed columns, + the lengths of the inner lists must match the number of selected rows. + - *dios*: the length of the columns must match the number of indexed columns - columns does *not* align, + they are just iterated. + Rows do align. Rows that are present on the right but not on the left are ignored. + Rows that are present on the left (bear in mind: these rows was explicitly chosen for write!), but not present + on the right, are filled with `NaN`s, like in pandas. + - *normal lists* : column keys must be a scalar(!), the list is passed down, and set with `loc`, `iloc` or `[]` by pandas Series. + - *pd.Series*: column indexer must be a scalar(!), the series is passed down, and set with `loc`, `iloc` or `[]` + by pandas Series, where it maybe align, depending on the method. + +**Examples:** - `dios.loc[2:5, 'a'] = [1,2,3]` is the same as `a=dios['a']; a.loc[2:5]=[1,2,3]; dios['a']=a` - `dios.loc[2:5, :] = 99` : set 99 on rows 2 to 5 on all columns -**the special indexer `.aloc`** +The special indexer `.aloc` +----------------------------- Additional to the pandas like indexers we have a `.aloc[..]` (align locator) indexing method. Unlike `.iloc` and `.loc` indexers fully align if possible and 1D-array-likes can be broadcast to multiple columns at once. This method also handle missing indexer-items gracefully. It is used like `.loc`, so a single indexer (`.aloc[indexer]`) or a tuple of row-indexer and -column-indexer (`.aloc[row-indexer, column-indexer]`) can be given. -Unlike the other indexer methods, it is not possible to get a single item returned; the return type -is either a pandas Series, iff the column-indexer is a single key (eg. `'a'`) or a dios, iff not. - -2D-indexer (like dios or df), only can passed as a single key, like `.aloc[2D-indexer]` or -with a ellipsis, as column indexer, like `.aloc[2D-indexer, ...]`. The behavior may differ between these -methods, as explained later below. - -If a normal (non 2D-dimensional) row indexer is given, but no column indexer, the latter defaults to `:` aka. -`slice(None)`, so `.aloc[row-indexer]` becomes `.aloc[row-indexer, :]`, which means, that all columns are used. -In general, a normal row-indexer is applied to every column, that was chosen by the column indexer, but for -each column separately. - -Example: -``` ->> d - a | b | c | d | -===== | ===== | ===== | ===== | -0 66 | 2 77 | 0 88 | 1 99 | -1 66 | 3 77 | 1 88 | 2 99 | - - ->> d.aloc[[1,2], ['a', 'b', 'd']] - a | b | d | -===== | ===== | ===== | -1 66 | 2 77 | 1 99 | - | | 2 99 | -``` - -Following the `.aloc` specific indexer are listed. Any indexer that is not listed (slice, boolean lists, ...) -are treated similar, as they would passed to `.loc` (actually they are really passed to `'loc` under the hood). - -*special **Column** indexer* are : -- *list / array-like* (or any iterable object): Only labels that are present in the columns are used, others are - ignored. A dios is returned. -- *pd.Series* : `.values` are taken from series and handled like a *list*. A dios is returned. -- *scalar* (or any hashable obj) : Select a single column, if label is present, otherwise nothing. [1] - -*special **Row** indexer* are : -- *list / array-like* (or any iterable object): Only rows, which indices are present in the index of the column are - used, others are ignored. A dios is returned. -- *scalar* (or any hashable obj) : Select a single row from a column, if the value is present in the index of - the column, otherwise nothing is selected. [1] -- *pd.Series* : align the index from the given Series with the column, what means only common indices are used. The - actual values of the series are ignored(!). -- *boolean pd.Series* : like *pd.Series* but only True values are evaluated. - False values are equivalent to missing indices. To treat a boolean series as a *normal* indexer series, as decribed - above, one can use `.aloc(usebool=False)[boolean pd.Series]`. - - -*special **2D**-indexer* are : -- `.aloc[boolean dios-like]` : work same like `di[boolean dios-like]` (see there). - Brief: full align, select items, where the index is present and the value is True. -- `.aloc[dios-like, ...]` (with Ellipsis) : Align in columns and rows, ignore its values. Per common column, - the common indices are selected. The ellipsis forces `aloc`, to ignore the values, so a boolean dios could be - treated as a non-boolean. Alternatively `.aloc(usebool=False)[boolean dios-like]` could be used.[2] -- `.aloc[nested list-like]` : The inner lists are used as `aloc`-*list*-row-indexer (see there) on all columns. - One list for one column, which implies, that the outer list has the same length as the number of columns. - -*special handling of 1D-**values*** - -Values that are list- or array-like, which includes pd.Series, are set on all selected columns. pd.Series align -like `s1.loc[:] = s2` do. - -Examples: - -``` ->>> d - a | b | -======== | ===== | -0 0.0 | 1 50 | -1 70.0 | 2 60 | -2 140.0 | 3 70 | - - ->>> d.aloc[[1,2]] - a | b | -======== | ===== | -1 70.0 | 1 50 | -2 140.0 | 2 60 | - - ->>> d.aloc[d>60] - a | b | -======== | ===== | -1 70.0 | 3 70 | -2 140.0 | | - - ->>> d2 = d.copy() ->>> d2.aloc[d>60] = 10 ->>> d2 - a | b | -======= | ===== | -0 0.0 | 1 50 | -1 10.0 | 2 60 | -2 10.0 | 3 10 | - - ->>> d.aloc[[2,12,0,'foo'], ['a', 'x', 99, None, 99]] - a | -======== | -0 0.0 | -2 140.0 | - - ->>> s=pd.Series(index=[1,11,111,1111]) ->>> s -1 NaN -11 NaN -111 NaN -1111 NaN -dtype: float64 - - ->>> d.aloc[s] - a | b | -======= | ===== | -1 70.0 | 1 50 | - - ->>> d.aloc['foobar'] -Empty DictOfSeries -Columns: ['a', 'b'] - - ->>> d.aloc[d,...] # (equal to use) d.aloc(usebool=False)[d] - a | b | -======== | ===== | -0 0.0 | 1 50 | -1 70.0 | 2 60 | -2 140.0 | 3 70 | - - ->>> d.aloc[d] -Traceback (most recent call last): - File ...bad..stuff... -ValueError: Must pass dios-like key with boolean values only if passed as single indexer - - ->>> b = d.astype(bool) ->>> b['b'] = False ->>> b - a | b | -======== | ======== | -0 False | 1 False | -1 True | 2 False | -2 True | 3 False | - - ->>> d.aloc[b] # (equal to use) d[b] - a | b | -======== | ======= | -1 70.0 | no data | -2 140.0 | | +column-indexer (`.aloc[row-indexer, column-indexer]`) can be given. Also it can handle boolean and *non-bolean* +2D-Indexer. -``` +For more information and examples see the [aloc usage](/docs/aloc_usage.md) and the [cookbook](docs/cookbook.md). Properties ---------- diff --git a/docs/aloc_usage.md b/docs/aloc_usage.md index e537686..816ec14 100644 --- a/docs/aloc_usage.md +++ b/docs/aloc_usage.md @@ -2,10 +2,114 @@ ========= Purpose -- select gracefully, so rows or columns, that was given as indexer, but doesn't exist, don't raise an error +-------- +- select gracefully, so rows or columns, that was given as indexer, but doesn't exist, not raise an error - align series/dios-indexer - setting multiple columns at once with a list-like value +Overview +-------- +`aloc` is *called* like `loc`, with a single key, that act as row indexer `aloc[rowkey]` or with a tuple of +row indexer and column indexer `aloc[rowkey, columnkey]`. Also 2D-indexer (like dios or df) can be given, but +only as a single key, like `.aloc[2D-indexer]` or with the special column key `...`, +the ellipsis (`.aloc[2D-indexer, ...]`). The ellipsis may change, how the 2D-indexer is +interpreted, but this will explained [later](#the-power-of-2d-indexer) in detail. + +If a normal (non 2D-dimensional) row indexer is given, but no column indexer, the latter defaults to `:` aka. +`slice(None)`, so `.aloc[row-indexer]` becomes `.aloc[row-indexer, :]`, which means, that all columns are used. +In general, a normal row-indexer is applied to every column, that was chosen by the column indexer, but for +each column separately. + +So maybe a first example gives an rough idea: +``` +>> d + a | b | c | d | +===== | ===== | ===== | ===== | +0 66 | 2 77 | 0 88 | 1 99 | +1 66 | 3 77 | 1 88 | 2 99 | + + +>> d.aloc[[1,2], ['a', 'b', 'd']] + a | b | d | +===== | ===== | ===== | +1 66 | 2 77 | 1 99 | + | | 2 99 | +``` + +The return type +---------------- + +Unlike the other two indexer methods `loc` and `iloc`, it is not possible to get a single item returned; +the return type is either a pandas.Series, iff the column-indexer is a single key (eg. `'a'`) or a dios, iff not. +The row-indexer does not play any role in the return type choice. + +*Note for the curios: This is because a scalar (`.aloc[key]`) is translates to `.loc[key:key]` under the hood.* + +Indexer types +------------- +Following the `.aloc` specific indexer are listed. Any indexer that is not listed below (slice, boolean lists, ...), +but are known to work with `.loc`, are treated as they would passed to `.loc`, as they actually do under the hood. + +Some indexer are linked to later sections, where a more detailed explanation and examples are given. + +*special [Column indexer](#select-columns-gracefully) are :* +- *list / array-like* (or any iterable object): Only labels that are present in the columns are used, others are + ignored. +- *pd.Series* : `.values` are taken from series and handled like a *list*. +- *scalar* (or any hashable obj) : Select a single column, if label is present, otherwise nothing. + + +*special [Row indexer](#selecting-rows-a-smart-way) are :* +- *list / array-like* (or any iterable object): Only rows, which indices are present in the index of the column are + used, others are ignored. A dios is returned. +- *scalar* (or any hashable obj) : Select a single row from a column, if the value is present in the index of + the column, otherwise nothing is selected. [1] +- *pd.Series* : align the index from the given Series with the column, what means only common indices are used. The + actual values of the series are ignored(!). +- *boolean pd.Series* : like *pd.Series* but only True values are evaluated. + False values are equivalent to missing indices. To treat a boolean series as a *normal* indexer series, as decribed + above, one can use `.aloc(usebool=False)[boolean pd.Series]`. + + +*special [2D-indexer](#the-power-of-2d-indexer) are :* +- `.aloc[boolean dios-like]` : work same like `di[boolean dios-like]` (see there). + Brief: full align, select items, where the index is present and the value is True. +- `.aloc[dios-like, ...]` (with Ellipsis) : Align in columns and rows, ignore its values. Per common column, + the common indices are selected. The ellipsis forces `aloc`, to ignore the values, so a boolean dios could be + treated as a non-boolean. Alternatively `.aloc(usebool=False)[boolean dios-like]` could be used.[2] +- `.aloc[nested list-like]` : The inner lists are used as `aloc`-*list*-row-indexer (see there) on all columns. + One list for one column, which implies, that the outer list has the same length as the number of columns. + +*special handling of 1D-**values*** + +Values that are list- or array-like, which includes pd.Series, are set on all selected columns. pd.Series align +like `s1.loc[:] = s2` do. See also the [cookbook](/docs/cookbook.md#broadcast-array-likes-to-multiple-columns). + +*Indexer Table* + +| example | type | on | handling | +| ------ | ------ | ------ |------ | +|**column indexer**| +| `.aloc[any, ['a']]` | scalar | columns | graceful | +| `.aloc[any, ['a','c']]` | list-like | columns | graceful | +| `.aloc[any [True,False]]` | bool list-like | columns | take `True`'s , length must match (!) | +| `.aloc[any, s]` | pandas.Series | columns | like list, only values | +| `.aloc[any, bs]` | bool pandas.Series | columns | like bool-list | +| `.aloc[any, 'b':'z']` | slice | columns | filter | +|**row indexer**| +| `.aloc[7, any]` | scalar | rows | translate to `.loc[key:key]` | +| `.aloc[[1,2,24], any]` | list-like | rows | handle graceful | +| `.aloc[[True,False], any]` | bool list-like | rows | take `True`'s, length must match nr of (all selected) columns (!) | +| `.aloc[s, any]` | pandas.Series | rows | like `.loc[s.index]` | +| `.aloc[bs, any]` | bool pandas.Series | rows | align + just take `True`'s, [1] | +|**2D indexer**| +| `.aloc[[[s],[1,2,3]], any]` | nested list-like | both | one row-indexer per column, outer length must match nr of columns(!) | +| `.aloc[di]` | dios-like | both | full align | +| `.aloc[di, ...]` | dios-like | both | full align, ellipsis has no effect | +| `.aloc[di>5]` | bool dios-like | both | full align + take `True`'s [1] | +| `.aloc[di>5, ...]` | (bool) dios-like | both | full align, disable bool evaluation | +[1] evaluate `usebool`-keyword + Example dios --------- @@ -51,7 +155,7 @@ Just like selecting *single columns gracefully*, but with a array-like indexer. A dios is returned, with a subset of the existing columns. If no key is present a empty dios is returned. -If the key is a pandas Series, its *values* are used for indexing, especially the Series's index is ignored. +If the key is a pandas.Series, its *values* are used for indexing, especially the Series's index is ignored. To select all columns simply use `.aloc[:,:]` or even simpler `.aloc[:]`, just like one would do with `loc` or `iloc`. @@ -83,21 +187,145 @@ d.aloc[:, s] Selecting Rows a smart way -------------------------- -Overview: +For scalar and array-like indexer with label values, the keys are handled gracefully, just like with +array-like column indexers. -| | | -| ------ | ------ | -| `.aloc[s]` | like `.loc[s.index]` | -| `.aloc[list]` | handle graceful | -| `.aloc[bool list]` | no merci, length must match all (selected) columns | -| `.aloc[bool series]` | align index and just take `True`'s -- [1] | -| `.aloc[key]` | translate to `.loc[key:key]` | -[1] evaluate `usebool`-keyword +``` +>>> d.aloc[1] + a | b | c | d | +==== | ======= | ======= | ======= | +1 7 | no data | no data | no data | -Note for the curios: *Because of `.aloc[key]` translates to `.loc[key:key]`, dios never return a single item, -nor a columns-indexed Series*. +>>> d.aloc[99] +Empty DictOfSeries +Columns: ['a', 'b', 'c', 'd'] + +>>> d.aloc[[3,6,7,18]] + a | b | c | d | +===== | ==== | ===== | ==== | +3 21 | 3 6 | 6 27 | 6 0 | + | 6 9 | 7 37 | 7 1 | +``` + +The length of columns can differ: +``` +>>> d.aloc[[3,6,7,18]].aloc[[3,6]] + a | b | c | d | +===== | ==== | ===== | ==== | +3 21 | 3 6 | 6 27 | 6 0 | + | 6 9 | | | +``` + +Boolean array-likes as row indexer +--------------------------------- + +For array-like indexer that hold boolean values, the length of the indexer and +the length of all column(s) to index must match. +``` +>>> d.aloc[[True,False,False,True,False]] + a | b | c | d | +===== | ==== | ===== | ==== | +0 0 | 2 5 | 4 7 | 6 0 | +3 21 | 5 8 | 7 37 | 9 3 | +``` +If the length does not match a `IndexError` is raised: +``` +>>> d.aloc[[True,False,False]] +Traceback (most recent call last): + ... + f"Boolean index has wrong length: " +IndexError: failed for column a: Boolean index has wrong length: 3 instead of 5 +``` + +This can be tricky, especially if columns have different length: +``` +>>> difflen + a | b | c | d | +===== | ==== | ===== | ==== | +0 0 | 2 5 | 4 7 | 6 0 | +1 7 | 3 6 | 6 27 | 7 1 | +2 14 | 4 7 | | 8 2 | + +>>> difflen.aloc[[False,True,False]] +Traceback (most recent call last): + ... + f"Boolean index has wrong length: " +IndexError: Boolean index has wrong length: 3 instead of 2 +``` + +pandas.Series and boolean pandas.Series as row indexer +------------------------------------------------------ + +When using a pandas.Series as row indexer with `aloc`, all its magic comes to light. +The index of the given series align itself with the index of each column separately and is this way used as a filter. + +``` +>>> s = d['b'] + 100 +>>> s +2 105 +3 106 +4 107 +5 108 +6 109 +Name: b, dtype: int64 + +>>> d.aloc[s] + a | b | c | d | +===== | ==== | ===== | ==== | +2 14 | 2 5 | 4 7 | 6 0 | +3 21 | 3 6 | 5 17 | | +4 28 | 4 7 | 6 27 | | + | 5 8 | | | + | 6 9 | | | +``` + +As seen in the example above the series' values are ignored completely. The functionality +is similar to `s1.loc[s2.index]`, with `s1` and `s2` are pandas.Series's, and s2 is the indexer and s1 is one column +after the other. + +If the indexer series holds boolean values they are not ignored. +The series align the same way as explained above, but additional only the `True` values are evaluated. +Thus `False`-values are treated like missing indices. The behavior here is analogous to `s1.loc[s2[s2].index]`. + +``` +>>> boolseries = d['b'] > 6 +>>> boolseries +2 False +3 False +4 True +5 True +6 True +Name: b, dtype: bool + +>>> d.aloc[boolseries] + a | b | c | d | +===== | ==== | ===== | ==== | +4 28 | 4 7 | 4 7 | 6 0 | + | 5 8 | 5 17 | | + | 6 9 | 6 27 | | +``` + +To evaluate boolean values is a very handy feature, as it can easily used with multiple conditions and also fits +nicely with writing those as one-liner: + +``` +>>> d.aloc[d['b'] > 6] + a | b | c | d | +===== | ==== | ===== | ==== | +4 28 | 4 7 | 4 7 | 6 0 | + | 5 8 | 5 17 | | + | 6 9 | 6 27 | | + +>>> d.aloc[(d['a'] > 6) & (d['b'] > 6)] + a | b | c | d | +===== | ==== | ==== | ======= | +4 28 | 4 7 | 4 7 | no data | +``` + +Nevertheless, something like `d.aloc[d['a'] > d['b']]` do not work, because the comparison fails, +as long as the two series objects not have the same index. But maybe one want to checkout +[DictOfSeries.index_of()](/docs/methods_and_properties.md#diosdictofseriesindex_of). -**T_O_D_O** The power of 2D-indexer ----------------------- diff --git a/docs/cookbook.md b/docs/cookbook.md index b26c959..ab0ce4f 100644 --- a/docs/cookbook.md +++ b/docs/cookbook.md @@ -7,6 +7,8 @@ Recipes - align dios with dios - get/set values by condition - apply a value to multiple columns +- [Broadcast array-likes to multiple columns](#broadcast-array-likes-to-multiple-columns) +- apply a array-like value to multiple columns - nan-policy - mask vs. drop values, when nan's are inserted (mv to Readme ??) - itype - when to use, pitfalls and best-practise - changing the index of series' in dios (one, some, all) @@ -14,3 +16,8 @@ Recipes - changing properties of series' in dios (one, some, all) **T_O_D_O** + + +Broadcast array-likes to multiple columns +----------------------------------------- +**T_O_D_O** diff --git a/docs/methods_and_properties.md b/docs/methods_and_properties.md index f8231b1..0927be4 100644 --- a/docs/methods_and_properties.md +++ b/docs/methods_and_properties.md @@ -7,22 +7,22 @@ Methods Brief - `copy(deep=True)` : Return a copy. See also [pandas.DataFrame.copy]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) - - [copy_empty()](#diosdictofseriescopy_empty) : Return a new DictOfSeries object, with same properties than the original. + - [`copy_empty()`](#diosdictofseriescopy_empty) : Return a new DictOfSeries object, with same properties than the original. - `all(axis=0)` : Return whether all elements are True, potentially over an axis. See also [pandas.DataFrame.all]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html) - `any(axis=0)` : Return whether any element is True, potentially over an axis. See also [pandas.DataFrame.any]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html) - `squeeze(axis=None)` : Squeeze a 1-dimensional axis objects into scalars. See also [pandas.DataFrame.squeeze](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.squeeze.html) - - [to_df()](#diosdictofseriesto_df) : Transform the Dios to a pandas.DataFrame + - [`to_df()`](#diosdictofseriesto_df) : Transform the Dios to a pandas.DataFrame - `to_string(kwargs)` : Return a string representation of the Dios. - - [apply()](#diosdictofseriesapply) : apply the given function to every column in the dios eg. + - [`apply()`](#diosdictofseriesapply) : apply the given function to every column in the dios eg. - `astype()` : Cast the data to the given data type. - `isin()` : return a boolean dios, that indicates if the corresponding value is in the given array-like - `isna()` : Return a bolean array that is `True` if the value is a Nan-value - `notna()` : inverse of `isnan()` - `dropna()` : drop all Nan-values - - [index_of()](#diosdictofseriesindex_of): Return a single(!) Index that is constructed from all the indexes of the columns. + - [`index_of()`](#diosdictofseriesindex_of): Return a single(!) Index that is constructed from all the indexes of the columns. - `len(Dios)` : return the number of columns the dios has. -- GitLab