Skip to content
Snippets Groups Projects
user avatar
authored

DictOfSeries (soon renamed to SoS?)

Is a pd.Series of pd.Series object which aims to behave as similar as possible to the pandas DataFrame.

Nomenclature

  • pd: pandas
  • series/ser: instance of pd.Series
  • dios: instance of DictOfSeries
  • df: instance of pd.DataFrame
  • dios-like: a dios or a df
  • alignable object: a dios, df or a series

Features

  • every column has its own index
  • uses much less memory than a misaligned pd.DataFrame
  • behaves quite like a pd.DataFrame
  • additional align locator (.aloc[])

Indexing

pandas-like indexing

[] and .loc[], .iloc[] and .at[], .iat[] - should behave exactly like their counter-parts from pd.DataFrame. They can take as indexer

  • lists, array-like objects and in general all iterables
  • boolean lists and iterables
  • slices
  • scalars and any hashable object

Most indexers are directly passed to the underling columns-series or row-series depending on the position of the indexer and the complexity of the operation. For .loc, .iloc, .at and iat the first position is the row indexer, the second the column indexer. The second can be omitted and will default to slice(None). Examples:

  • di.loc[[1,2,3], ['a']] : select labels 1,2,3 from column a
  • di.iloc[[1,2,3], [0,3]] : select positions 1,2,3 from the columns 0 and 3
  • di.loc[:, 'a':'c'] : select all rows from columns a to d
  • di.at[4,'c'] : select the elements with label 4 in column c
  • di.loc[:] -> di.loc[:,:] : select everything

Scalar indexing always return a pandas Series if the other indexer is a non-scalar. If both indexer are scalars, the element itself is returned. In all other cases a dios is returned. For more pandas-like indexing magic and the differences between the indexers, see the pandas documentation.

2D-indexer

dios[boolean dios-like] (as single key) - dios accept boolean 2D-indexer (boolean pd.Dataframe or boolean Dios).

Columns and rows from the indexer align with the dios. This means that only matching columns selected and in this columns rows are selected where i) indices are match and ii) the value is True in the indexer-bool-dios. There is no difference between missing indices and present indices, but False values.

Values from unselected rows and columns are dropped, but empty columns are still preserved, with the effect that the resulting Dios always have the same column dimension than the initial dios. This is the exact similar behavior to pd.DataFrame's handling of 2D-indexer, despite that pd.DataFrame fill np.nans at missing locations and therefore also fill-up, whole missing columns with nans.

setting values

Setting values with [] and .loc[], .iloc[] and .at[], .iat[] works like in pandas. With .at/.iat only single items can be set, for the other the right hand side values can be:

  • scalars: these are broadcasted to the selected positions
  • nested lists: the length of the outer list must match the number of indexed columns, the lengths of the inner lists must match the number of selected rows.
  • dios: the length of the columns must match the number of indexed columns - columns does not align, they are just iterated. Rows do align. Rows that are present on the right but not on the left are ignored. Rows that are present on the left (bear in mind: these rows was explicitly chosen for write!), but not present on the right, are filled with NaNs, like in pandas.
  • normal lists : column keys must be a scalar(!), the list is passed down, and set with loc, iloc or [] by pandas Series.
  • pd.Series: column indexer must be a scalar(!), the series is passed down, and set with loc, iloc or [] by pandas Series, where it maybe align, depending on the method.

Examples:

  • dios.loc[2:5, 'a'] = [1,2,3] is the same as a=dios['a']; a.loc[2:5]=[1,2,3]; dios['a']=a
  • dios.loc[2:5, :] = 99 : set 99 on rows 2 to 5 on all columns

the special indexer .aloc

Additional to the pandas like indexers we have a .aloc[..] (align locator) indexing method. Unlike .iloc and .loc indexers fully align if possible and 1D-array-likes can be broadcast to multiple columns at once. This method also handle missing indexer-items gracefully. It is used like .loc, so a single indexer (.aloc[indexer]) or a tuple of row-indexer and column-indexer (.aloc[row-indexer, column-indexer]) can be given. Unlike the other indexer methods, aloc always return a dios !

2D-indexer (like dios or df), only can passed as a single key, like .aloc[2D-indexer] or with a ellipsis, as column indexer, like .aloc[2D-indexer, ...]. The behavior may differ between these methods, as explained later below.

If a normal (non 2D-dimensional) row indexer is given, but no column indexer, the latter defaults to : aka. slice(None), so .aloc[row-indexer] becomes .aloc[row-indexer, :], which means, that all columns are used. In general, a normal row-indexer is applied to every column, that was chosen by the column indexer, but for each column separately.

Example:

>> d
    a |     b |     c |     d | 
===== | ===== | ===== | ===== | 
0  66 | 2  77 | 0  88 | 1  99 | 
1  66 | 3  77 | 1  88 | 2  99 | 


>> d.aloc[[1,2], ['a', 'b', 'd']]
    a |     b |     d | 
===== | ===== | ===== | 
1  66 | 2  77 | 1  99 | 
      |       | 2  99 | 

Following the .aloc specific indexer are listed. Any indexer that is not listed (slice, boolean lists, ...) are treated similar, as they would passed to .loc (actually they are really passed to 'loc under the hood).

special Column indexer are :

  • list / array-like (or any iterable object): Only labels that are present in the columns are used, others are ignored. A dios is returned.
  • pd.Series : .values are taken from series and handled like a list. A dios is returned.
  • scalar (or any hashable obj) : Select a single column, if label is present, otherwise nothing. [1]

special Row indexer are :

  • list / array-like (or any iterable object): Only rows, which indices are present in the index of the column are used, others are ignored. A dios is returned.
  • scalar (or any hashable obj) : Select a single row from a column, if the value is present in the index of the column, otherwise nothing is selected. [1]
  • pd.Series : align the index from the given Series with the column, what means only common indices are used. The actual values of the series are ignored(!).

special 2D-indexer are :

  • .aloc[boolean dios-like] : work same like di[boolean dios-like] (see there). Brief: full align, select where the index is present and the value is True.
  • .aloc[dios-like, ...] (with Ellipsis) : Align in columns and rows, ignore values. Per common column, the common indices are selected. The ellipsis forces to ignore the values, so a boolean dios could be given, where, the values are not taken into account. [2]
  • .aloc[nested list-like] : The inner lists are used as aloc-list-row-indexer (see there) on all columns. One list for one column, which implies, that the outer list has the same length as the number of columns.
  • .aloc(booldios=True)[boolean dios-like] : alias for .aloc[boolean dios-like]
  • .aloc(booldios=False)[dios-like] : alias for .aloc[dios-like, ...]

special handling of 1D-values

Values that are list- or array-like, which includes pd.Series, are set on all selected columns. pd.Series align like s1.loc[:] = s2 do.

Examples:

>>> d
       a |     b | 
======== | ===== | 
0    0.0 | 1  50 | 
1   70.0 | 2  60 | 
2  140.0 | 3  70 | 


>>> d.aloc[[1,2]]
       a |     b | 
======== | ===== | 
1   70.0 | 1  50 | 
2  140.0 | 2  60 |  


>>> d.aloc[d>60]
       a |     b | 
======== | ===== | 
1   70.0 | 3  70 | 
2  140.0 |       | 


>>> d2 = d.copy()
>>> d2.aloc[d>60] = 10
>>> d2
      a |     b | 
======= | ===== | 
0   0.0 | 1  50 | 
1  10.0 | 2  60 | 
2  10.0 | 3  10 | 


>>> d.aloc[[2,12,0,'foo'], ['a', 'x', 99, None, 99]]
       a | 
======== | 
0    0.0 | 
2  140.0 | 


>>> s=pd.Series(index=[1,11,111,1111])
>>> s
1     NaN
11    NaN
111   NaN
1111  NaN
dtype: float64


>>> d.aloc[s]
      a |     b | 
======= | ===== | 
1  70.0 | 1  50 | 


>>> d.aloc['foobar']
Empty DictOfSeries
Columns: ['a', 'b']


>>> d.aloc[d,...]   # (equal to use) d.aloc(booldios=False)[d]
       a |     b | 
======== | ===== | 
0    0.0 | 1  50 | 
1   70.0 | 2  60 | 
2  140.0 | 3  70 | 


>>> d.aloc[d]
Traceback (most recent call last):
  File ...bad..stuff...
ValueError: Must pass dios-like key with boolean values only if passed as single indexer


>>> b = d.astype(bool)
>>> b['b'] = False
>>> b
       a |        b | 
======== | ======== | 
0  False | 1  False | 
1   True | 2  False | 
2   True | 3  False | 


>>> d.aloc[b]   # (equal to use) d[b]
       a |       b | 
======== | ======= | 
1   70.0 | no data | 
2  140.0 |         | 

Properties

  • columns
  • indexes (series of indexes of all series's)
  • lengths (series of lengths of all series's)
  • values (not fully pd-like - np.array of series's values)
  • dtypes
  • itype (see section Itype)
  • empty
  • size

Methods and implied features

Work mostly like analogous methods from pd.DataFrame.

  • copy()
  • copy_empty()
  • all()
  • any()
  • squeeze()
  • to_df()
  • to_string()
  • apply()
  • astype()
  • isna()
  • notna()
  • dropna()
  • memory_usage()
  • index_of()
  • in
  • is
  • len(Dios)

Operators and Comparators

  • arithmetical: + - * ** // / % and abs()
  • boolean: &^|~
  • comparators: == != > >= < <=

Itype

DictOfSeries holds multiple series, and each series can have a different index length and index type. Differing index lengths are either solved by some aligning magic, or simply fail, if aligning makes no sense (eg. assigning the very same list to series of different lengths (see .aloc).

A bigger challange is the type of the index. If one series has an alphabetical index, and another one a numeric index, selecting along columns can fail in every scenario. To keep track of the types of index or to prohibit the inserting of a not fitting index type, we introduce the itype. This can be set on creation of a Dios and also changed during usage. On change of the itype, all indexes of all series in the dios are casted to a new fitting type, if possible. Different cast-mechanisms are available.

If an itype prohibits some certain types of indexes and a series with a non-fitting index-type is inserted, an implicit type cast is done (with or without a warning) or an error is raised. The warning/error policy can be adjusted via global options.

Have fun :)