-
ff = int(n * fill_factor) s.iloc[:ff] = list(range(ff))
Is this a adequate representation of our flags ? Even with
i = 4
andfill_factor = 0.01
this produces(10 ** 4) * 0.01 = 100
unique flag values. That is already a lot, considering that we usually only useBAD
andUNFLAGGED
.def get_size(obj): return len(pickle.dumps(obj))
There is also
Series.memory_usage
, which produces:0.00 0.01 0.10 0.25 0.50 1.00 100 0.3700 0.3650 0.2000 -0.0350 -1.1275 -2.6675 1000 0.4308 0.4137 0.1270 -0.2685 -0.9095 -2.1915 10000 0.4368 0.4064 0.1183 -0.1630 -0.7008 -1.7764 100000 0.4374 0.3493 0.1599 -0.0803 -0.6605 -1.5710 1000000 0.4375 0.3535 0.0679 -0.4034 -1.0568 -2.3635
Edited by David Schäfer -
🎇 @palmbAuthor OwnerIs this a adequate representation of our flags
This is more a general estimation of gain by usage of categoricals. i guess what comes close to our Flags, is the first column (2 uniques (BAD/UNFLAGGED) is close enough to 0 uniques - the difference will be just some bytes)
There is also
Series.memory_usage
Yeah i know, but i had some problems with that in the past. But if i remember correctly this was mostly with nested data, so maybe
memory_usage
would be the more accurate way, IDK. Counting the bytes of a pickle byte-stream is for sure an upper bound.For me it looks a bit odd to have a >200% increase of memory (see column 1.00) when using categories instead of floats, but idk. Even though pandas states [1] that this might happen..
[1] https://pandas.pydata.org/docs/user_guide/categorical.html#categorical-memory
Edited by Bert Palm
Please register or sign in to comment