DictOfSeries (soon renamed to SoS?)
Is a pd.Series of pd.Series object which aims to behave as similar as possible to the pandas DataFrame.
Nomenclature
- pd: pandas
- series/ser: instance of pd.Series
- dios: instance of DictOfSeries
- df: instance of pd.DataFrame
- dios-like: a dios or a df
- alignable object: a dios, df or a series
Features
- every column has its own index
- uses much less memory than a misaligned pd.DataFrame
- behaves quite like a pd.DataFrame
- additional align locator (
.aloc[]
)
Indexing
pandas-like indexing
[]
and .loc[]
, .iloc[]
and .at[]
, .iat[]
- should behave exactly like
their counter-parts from pd.DataFrame. They can take as indexer
- lists, array-like objects and in general all iterables
- boolean lists and iterables
- slices
- scalars and any hashable object
Most indexers are directly passed to the underling columns-series or row-series depending
on the position of the indexer and the complexity of the operation. For .loc
, .iloc
, .at
and iat
the first position is the row indexer, the second the column indexer. The second
can be omitted and will default to slice(None)
. Examples:
-
di.loc[[1,2,3], ['a']]
: select labels 1,2,3 from column a -
di.iloc[[1,2,3], [0,3]]
: select positions 1,2,3 from the columns 0 and 3 -
di.loc[:, 'a':'c']
: select all rows from columns a to d -
di.at[4,'c']
: select the elements with label 4 in column c -
di.loc[:]
->di.loc[:,:]
: select everything
Scalar indexing always return a pandas Series if the other indexer is a non-scalar. If both indexer are scalars, the element itself is returned. In all other cases a dios is returned. For more pandas-like indexing magic and the differences between the indexers, see the pandas documentation.
2D-indexer
dios[boolean dios-like]
(as single key) - dios accept boolean 2D-indexer (boolean pd.Dataframe
or boolean Dios).
Columns and rows from the indexer align with the dios. This means that only matching columns selected and in this columns rows are selected where i) indices are match and ii) the value is True in the indexer-bool-dios. There is no difference between missing indices and present indices, but False values.
Values from unselected rows and columns are dropped, but empty columns are still preserved, with the effect that the resulting Dios always have the same column dimension than the initial dios. This is the exact similar behavior to pd.DataFrame's handling of 2D-indexer, despite that pd.DataFrame fill np.nans at missing locations and therefore also fill-up, whole missing columns with nans.
setting values
Setting values with []
and .loc[]
, .iloc[]
and .at[]
, .iat[]
works like in pandas.
With .at
/.iat
only single items can be set, for the other the
right hand side values can be:
- scalars: these are broadcasted to the selected positions
- nested lists: the length of the outer list must match the number of indexed columns, the lengths of the inner lists must match the number of selected rows.
-
dios: the length of the columns must match the number of indexed columns - columns does not align,
they are just iterated.
Rows do align. Rows that are present on the right but not on the left are ignored.
Rows that are present on the left (bear in mind: these rows was explicitly chosen for write!), but not present
on the right, are filled with
NaN
s, like in pandas. -
normal lists : column keys must be a scalar(!), the list is passed down, and set with
loc
,iloc
or[]
by pandas Series. -
pd.Series: column indexer must be a scalar(!), the series is passed down, and set with
loc
,iloc
or[]
by pandas Series, where it maybe align, depending on the method.
Examples:
-
dios.loc[2:5, 'a'] = [1,2,3]
is the same asa=dios['a']; a.loc[2:5]=[1,2,3]; dios['a']=a
-
dios.loc[2:5, :] = 99
: set 99 on rows 2 to 5 on all columns
the special indexer .aloc
Additional to the pandas like indexers we have a .aloc[..]
(align locator) indexing method.
Unlike .iloc
and .loc
indexers fully align if possible and 1D-array-likes can be broadcast
to multiple columns at once. This method also handle missing indexer-items gracefully.
It is used like .loc
, so a single indexer (.aloc[indexer]
) or a tuple of row-indexer and
column-indexer (.aloc[row-indexer, column-indexer]
) can be given.
Unlike the other indexer methods, it is not possible to get a single item returned; the return type
is either a pandas Series, iff the column-indexer is a single key (eg. 'a'
) or a dios, iff not.
2D-indexer (like dios or df), only can passed as a single key, like .aloc[2D-indexer]
or
with a ellipsis, as column indexer, like .aloc[2D-indexer, ...]
. The behavior may differ between these
methods, as explained later below.
If a normal (non 2D-dimensional) row indexer is given, but no column indexer, the latter defaults to :
aka.
slice(None)
, so .aloc[row-indexer]
becomes .aloc[row-indexer, :]
, which means, that all columns are used.
In general, a normal row-indexer is applied to every column, that was chosen by the column indexer, but for
each column separately.
Example:
>> d
a | b | c | d |
===== | ===== | ===== | ===== |
0 66 | 2 77 | 0 88 | 1 99 |
1 66 | 3 77 | 1 88 | 2 99 |
>> d.aloc[[1,2], ['a', 'b', 'd']]
a | b | d |
===== | ===== | ===== |
1 66 | 2 77 | 1 99 |
| | 2 99 |
Following the .aloc
specific indexer are listed. Any indexer that is not listed (slice, boolean lists, ...)
are treated similar, as they would passed to .loc
(actually they are really passed to 'loc
under the hood).
special Column indexer are :
- list / array-like (or any iterable object): Only labels that are present in the columns are used, others are ignored. A dios is returned.
-
pd.Series :
.values
are taken from series and handled like a list. A dios is returned. - scalar (or any hashable obj) : Select a single column, if label is present, otherwise nothing. [1]
special Row indexer are :
- list / array-like (or any iterable object): Only rows, which indices are present in the index of the column are used, others are ignored. A dios is returned.
- scalar (or any hashable obj) : Select a single row from a column, if the value is present in the index of the column, otherwise nothing is selected. [1]
- pd.Series : align the index from the given Series with the column, what means only common indices are used. The actual values of the series are ignored(!).
-
boolean pd.Series : like pd.Series but only True values are evaluated.
False values are equivalent to missing indices. To treat a boolean series as a normal indexer series, as decribed
above, one can use
.aloc(usebool=False)[boolean pd.Series]
.
special 2D-indexer are :
-
.aloc[boolean dios-like]
: work same likedi[boolean dios-like]
(see there). Brief: full align, select items, where the index is present and the value is True. -
.aloc[dios-like, ...]
(with Ellipsis) : Align in columns and rows, ignore its values. Per common column, the common indices are selected. The ellipsis forcesaloc
, to ignore the values, so a boolean dios could be treated as a non-boolean. Alternatively.aloc(usebool=False)[boolean dios-like]
could be used.[2] -
.aloc[nested list-like]
: The inner lists are used asaloc
-list-row-indexer (see there) on all columns. One list for one column, which implies, that the outer list has the same length as the number of columns.
special handling of 1D-values
Values that are list- or array-like, which includes pd.Series, are set on all selected columns. pd.Series align
like s1.loc[:] = s2
do.
Examples:
>>> d
a | b |
======== | ===== |
0 0.0 | 1 50 |
1 70.0 | 2 60 |
2 140.0 | 3 70 |
>>> d.aloc[[1,2]]
a | b |
======== | ===== |
1 70.0 | 1 50 |
2 140.0 | 2 60 |
>>> d.aloc[d>60]
a | b |
======== | ===== |
1 70.0 | 3 70 |
2 140.0 | |
>>> d2 = d.copy()
>>> d2.aloc[d>60] = 10
>>> d2
a | b |
======= | ===== |
0 0.0 | 1 50 |
1 10.0 | 2 60 |
2 10.0 | 3 10 |
>>> d.aloc[[2,12,0,'foo'], ['a', 'x', 99, None, 99]]
a |
======== |
0 0.0 |
2 140.0 |
>>> s=pd.Series(index=[1,11,111,1111])
>>> s
1 NaN
11 NaN
111 NaN
1111 NaN
dtype: float64
>>> d.aloc[s]
a | b |
======= | ===== |
1 70.0 | 1 50 |
>>> d.aloc['foobar']
Empty DictOfSeries
Columns: ['a', 'b']
>>> d.aloc[d,...] # (equal to use) d.aloc(usebool=False)[d]
a | b |
======== | ===== |
0 0.0 | 1 50 |
1 70.0 | 2 60 |
2 140.0 | 3 70 |
>>> d.aloc[d]
Traceback (most recent call last):
File ...bad..stuff...
ValueError: Must pass dios-like key with boolean values only if passed as single indexer
>>> b = d.astype(bool)
>>> b['b'] = False
>>> b
a | b |
======== | ======== |
0 False | 1 False |
1 True | 2 False |
2 True | 3 False |
>>> d.aloc[b] # (equal to use) d[b]
a | b |
======== | ======= |
1 70.0 | no data |
2 140.0 | |
Properties
- columns
- indexes (series of indexes of all series's)
- lengths (series of lengths of all series's)
- values (not fully pd-like - np.array of series's values)
- dtypes
- itype (see section Itype)
- empty
- size
Methods and implied features
Work mostly like analogous methods from pd.DataFrame.
copy()
copy_empty()
all()
any()
squeeze()
to_df()
to_string()
apply()
astype()
isna()
notna()
dropna()
memory_usage()
index_of()
in
is
len(Dios)
Operators and Comparators
- arithmetical:
+ - * ** // / %
andabs()
- boolean:
&^|~
- comparators:
== != > >= < <=
Itype
DictOfSeries holds multiple series, and each series can have a different index length
and index type. Differing index lengths are either solved by some aligning magic, or simply fail, if
aligning makes no sense (eg. assigning the very same list to series of different lengths (see .aloc
).
A bigger challange is the type of the index. If one series has an alphabetical index, and another one
a numeric index, selecting along columns can fail in every scenario. To keep track of the
types of index or to prohibit the inserting of a not fitting index type,
we introduce the itype
. This can be set on creation of a Dios and also changed during usage.
On change of the itype, all indexes of all series in the dios are casted to a new fitting type,
if possible. Different cast-mechanisms are available.
If an itype prohibits some certain types of indexes and a series with a non-fitting index-type is inserted, an implicit type cast is done (with or without a warning) or an error is raised. The warning/error policy can be adjusted via global options.
Have fun :)