mask() vs select()
i just stumbled upon the saqc function mask()
. It works the opposite as the word mask suggest, in other words its selects data instead of masking it. See also pd.Series.mask()
, which mask the data with NaNs and pd.Series.where()
which do the opposite.
so i suggest to rename mask()
to select select()
and write a mask-function which do the actual masking..
i guess the confusion come from the following workflow:
mask = df > 42
df[mask] = 'new_value'
but this is only a quite common workflow for __getitem__
and __setitem__
, because its easier to handle (and understand) to write something like df[df>42] = 0
instead of df[~(df>42)] = 0
. In fact the []
inverts the mask before applying it or in other word it do a where()
and not a mask()
. It gets clearer with a example:
>>> s = pd.Series([1,2,3])
>>> cond = s == 2
>>> cond
0 False
1 True
2 False
dtype: bool
>>> s.mask(cond) # as expected
0 1.0
1 NaN
2 3.0
dtype: float64
>>> s.where(cond)
0 NaN
1 2.0
2 NaN
dtype: float64
>>> s[cond]
1 2
dtype: int64
>>> s[cond] = 99
>>> s
0 1
1 99
2 3
dtype: int64
but if we call cond
a mask
its gets confusing, because s[mask]
actually do a selection or an inverted masking.