categorical dtype gain over float dtype ($139) · Snippets · rdm-software / SaQC

   ff = int(n * fill_factor)
   s.iloc[:ff] = list(range(ff))

Is this a adequate representation of our flags ? Even with i = 4 and fill_factor = 0.01 this produces (10 ** 4) * 0.01 = 100 unique flag values. That is already a lot, considering that we usually only use BAD and UNFLAGGED.

def get_size(obj):
    return len(pickle.dumps(obj))

There is also Series.memory_usage, which produces:

           0.00    0.01    0.10    0.25    0.50    1.00
100      0.3700  0.3650  0.2000 -0.0350 -1.1275 -2.6675
1000     0.4308  0.4137  0.1270 -0.2685 -0.9095 -2.1915
10000    0.4368  0.4064  0.1183 -0.1630 -0.7008 -1.7764
100000   0.4374  0.3493  0.1599 -0.0803 -0.6605 -1.5710
1000000  0.4375  0.3535  0.0679 -0.4034 -1.0568 -2.3635

Is this a adequate representation of our flags

This is more a general estimation of gain by usage of categoricals. i guess what comes close to our Flags, is the first column (2 uniques (BAD/UNFLAGGED) is close enough to 0 uniques - the difference will be just some bytes)

There is also Series.memory_usage

Yeah i know, but i had some problems with that in the past. But if i remember correctly this was mostly with nested data, so maybe memory_usage would be the more accurate way, IDK. Counting the bytes of a pickle byte-stream is for sure an upper bound.

For me it looks a bit odd to have a >200% increase of memory (see column 1.00) when using categories instead of floats, but idk. Even though pandas states [1] that this might happen..

[1] https://pandas.pydata.org/docs/user_guide/categorical.html#categorical-memory