Skip to content
  •    ff = int(n * fill_factor)
       s.iloc[:ff] = list(range(ff))

    Is this a adequate representation of our flags ? Even with i = 4 and fill_factor = 0.01 this produces (10 ** 4) * 0.01 = 100 unique flag values. That is already a lot, considering that we usually only use BAD and UNFLAGGED.

    def get_size(obj):
        return len(pickle.dumps(obj))

    There is also Series.memory_usage, which produces:

               0.00    0.01    0.10    0.25    0.50    1.00
    100      0.3700  0.3650  0.2000 -0.0350 -1.1275 -2.6675
    1000     0.4308  0.4137  0.1270 -0.2685 -0.9095 -2.1915
    10000    0.4368  0.4064  0.1183 -0.1630 -0.7008 -1.7764
    100000   0.4374  0.3493  0.1599 -0.0803 -0.6605 -1.5710
    1000000  0.4375  0.3535  0.0679 -0.4034 -1.0568 -2.3635
    Edited by David Schäfer
  • Is this a adequate representation of our flags

    This is more a general estimation of gain by usage of categoricals. i guess what comes close to our Flags, is the first column (2 uniques (BAD/UNFLAGGED) is close enough to 0 uniques - the difference will be just some bytes)

    There is also Series.memory_usage

    Yeah i know, but i had some problems with that in the past. But if i remember correctly this was mostly with nested data, so maybe memory_usage would be the more accurate way, IDK. Counting the bytes of a pickle byte-stream is for sure an upper bound.

    For me it looks a bit odd to have a >200% increase of memory (see column 1.00) when using categories instead of floats, but idk. Even though pandas states [1] that this might happen..

    [1] https://pandas.pydata.org/docs/user_guide/categorical.html#categorical-memory

    Edited by Bert Palm
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment