DictOfSeries (soon renamed to SoS?)
Is a pd.Series of pd.Series object which aims to behave as much as possible similar to pd.DataFrame.
Nomenclature
- pd: pandas
- series/ser: instance of pd.Series
- dios: instance of DictOfSeries
- df: instance of pd.DataFrame
- dios-like: a dios or a df
- alignable object: a dios, df or a series
Features
- every column has its own index
- use very less memory then a disalignd pd.Dataframe
- act quite like pd.DataFrame
- additional align locator (
.aloc[]
)
Indexing
pandas-like indexing
dios[]
and .loc[]
, .iloc[]
and .at[]
, .iat[]
- should behave exactly like
their counter-parts from pd.Dataframe. They can take as indexer
- lists, array-like, in general iterables
- boolean lists and iterables
- slices
- scalars or any hashable obj
Most indexers are directly passed to the underling columns-series or row-series depending
on position of the indexer and the complexity of the operation. For .loc
, .iloc
, .at
and iat
the first position is the row indexer, the second the column indexer. The second
can be omitted and will default to slice(None)
. Examples:
-
di.loc[[1,2,3], ['a']]
: select labels 1,2,3 from column a -
di.iloc[[1,2,3], [0,3]]
: select positions 1,2,3 from columns at position 0 and 3 -
di.loc[:, 'a':'c']
: select all from columns a to d -
di.at[4,'c']
: select element at lebel 4 in columns c -
di.loc[:]
->di.loc[:,:]
: select everything
Scalar indexing always return a Series if the other indexer is a non-scalar. If both indexer are scalars the stored element itself is returned. In all other cases a dios is returned. For more pandas-like indexing magic and the differences between the indexers, see the pandas documentation.
multi-dimensional indexer
dios[boolean dios-like]
(as single key) - dios accept boolean multi-indexer (boolean pd.Dataframe
or boolean Dios). Columns and rows from the multi-indexer align with the dios.
This means that only matching columns are selected/written, the same apply for rows.
Rows or whole columns that are missing in the indexer, but are present in the Dios are dropped,
but empty columns are preserved, with the effect that the resulting Dios always have the same
column dimension than the initial Dios.
This is a similar behavior to pd.DataFrame handling of multi-indexer, despite that pd.DataFrame
fill np.nans at missing locations and columns.
setting values
Setting values with di[]
and .loc[]
, .iloc[]
and .at[]
, .iat[]
work like in pandas.
With .at
/.iat
only single items can be set, for the other the
values can be:
- scalars: these are broadcast to the selected positions
- nested lists: the outer list must match selected columns length, the inner lists lengths must match selected rows.
- normal lists : columns key must be a scalar(!), the list is passed down, and set to the underlying series.
- pd.Series: columns key must be a scalar(!), the series is passed down, and set to the underlying series in the dios, where both are aligned.
Examples:
-
dios.loc[2:5, 'a'] = [1,2,3]
is the same asa=dios['a']; a.loc[2:5]=[1,2,3]
-
dios.loc[2:5, :] = 99
: set 99 on rows 2 to 5 on all columns
special indexer .aloc
Additional to the pandas like indexers we have a .aloc[..]
(align locator) indexing method.
Unlike .iloc
and .loc
indexers and/or values fully align if possible and 1D-array-likes
can be broadcast to multiple columns at once. Also this method handle missing indexer-items gratefully.
It is used like .loc
, so a single row-indexer (.aloc[row-indexer]
) or a tuple of row-indexer and
column-indexer (.aloc[row-indexer, column-indexer]
) can be given.
Alignable indexer are:
-
.aloc[pd.Series]
: only common indices are used in each column -
.aloc[boolean dios-like]
(as single key) : work same likedi[boolean dios-like]
(see above)
only matching columns and matching indices are used
if the value is `True` (Values that are `False` are dropped and handled as they would be missing)
In contrast to *normal* indexing, with `di[boolean dios-like]` (see above), missing rows are **not**
filled with nan's, instead they are dropped on selection operations and ignored on setting operations.
Nevertheless empty columns are still preserved.
- `.aloc[dios-like, ...]` (dios-like, **Ellipsis**) : "`...`" is not a placeholder, it refer to the ellipsis object.
Full align -> use only matching columns and indices. Alternatively, `.aloc(booldios=False)[dios]` can be used.
Indexer that are handled grateful:
-
.aloc[list]
(lists or any iterable obj) : only present labels/positions are used -
.aloc[scalars]
(or any hashable obj) : return underling item if present or a empty pd.Series if not
Alignable values are:
-
.aloc[any] = pd.Series
: per column, only common indices are used and the corresponding value is set -
.aloc[any] = dios
(dios-like): only matching columns and indices are used and the corresponding value is set
For all other indexers and/or values .loc
automatically is used as fallback.
Examples:
>>> d
a | b |
======== | ===== |
0 0.0 | 1 50 |
1 70.0 | 2 60 |
2 140.0 | 3 70 |
>>> d.aloc[[1,2]]
a | b |
======== | ===== |
1 70.0 | 1 50 |
2 140.0 | 2 60 |
>>> d.aloc[d>60]
a | b |
======== | ===== |
1 70.0 | 3 70 |
2 140.0 | |
>>> d.aloc[d>60] = 10
>>> d
a | b |
======= | ===== |
0 0.0 | 1 50 |
1 10.0 | 2 60 |
2 10.0 | 3 10 |
>>> d.aloc[[2,12,0,'foo'], ['a', 'x', 99, None, 99]]
a |
======= |
0 0.0 |
2 10.0 |
>>> s=pd.Series(index=[1,11,111,1111])
>>> s
1 NaN
11 NaN
111 NaN
1111 NaN
dtype: float64
>>> d.aloc[s]
a | b |
======= | ===== |
1 10.0 | 1 50 |
>>> d.aloc['foobar']
Empty DictOfSeries
Columns: ['a', 'b']
>>> d.aloc[d,...] # (or use) d.aloc(booldios=False)[d]
a | b |
====== | ===== |
0 0 | 1 50 |
1 70 | 2 60 |
2 140 | 3 70 |
>>> d.aloc[d]
Traceback (most recent call last):
File ...bad..stuff...
ValueError: Must pass dios-like key with boolean values only if passed as single indexer
>>> b = d.astype(bool)
>>> b
a | b |
======== | ======= |
0 False | 1 True |
1 True | 2 True |
2 True | 3 True |
>>> d.aloc[b]
a | b |
====== | ===== |
1 70 | 1 50 |
2 140 | 2 60 |
| 3 70 |
Properties
- columns
- indexes (series of indexes of all series's)
- lengths (series of lengths of all series's)
- values (not fully pd-like - np.array of series's values)
- dtypes
- itype (see section Itype)
- empty
- size
Methods and implied features
Work mostly like analogous methods from pd.DataFrame.
- copy()
- copy_empty()
- all()
- any()
- squeeze()
- to_df()
- to_string()
- apply()
- astype()
- isna()
- notna()
- dropna()
- memory_usage()
- index_of()
in
is
len(Dios)
Operators and Comparators
- arithmetical:
+ - * ** // / %
andabs()
- boolean:
&^|~
- comparators:
== != > >= < <=
Itype
DictOfSeries holds multiple series, where possibly every series can have a different index length
and index type. Different index length, is solved with some aligning magic, or simply fail, if
aligning makes no sense (eg. assigning the very same list to series of different length (see .aloc
).
The bigger problem is the type of the index. If one series has a alphabetical index, an other
an numeric index, selecting along columns, can just fail in every scenario. To keep track of the
types of index or to prohibit the inserting of a not fitting index type,
we introduce a itype
. This can be set on creation of a Dios and also changed during usage.
On change of the itype, all indexes of all series in the dios are casted to a new fitting type,
if possible. Different cast-mechanisms are available.
If a itype prohibit some certain types of indexes, but a series with a non-fitting index-type is inserted, a implicit cast is done, with or without a warning, or an error is raised. The warning/error policy can be adjusted via global options.
Have fun :)