Using Sparse Histories to spare memory
As @palmb pointed out, we are appending a lot of almost-all-NaN columns to our histories and by this, blow up memory usage. This is bad and will likely result in usage limitations.
But, since the histories in practice, really do consist of mostly NaN values, this is also good, since, by this, they qualify as sparse matrices.
Of course, tweaking the whole history/flags mechanics, is not realistic to be implemented in near future, since we would have to switch everything to scipy.sparse
usage.
BUT!
We do not really need to implement a sophisticated sparsity frame work, since we are not really applying complex matrix operations onto the histories. We are only appending and retrieving row-maxima series.
Also, pandas
supports sparse dataframes.
I made some benchmarking with this script: sparse_bench.py
Here are some result numbers: (script is easy to use - feel free to change starting conditions)
size of dataframe: 100000 x 20 (2000000 values)
number of not-na values: 5000. (0.0025 percent)
casting from dense to sparse : 0.024259119429998463 seconds per cast.
dense_memory : 16800.00 bytes
sparse_memory: 858.48 bytes
sparse to dense column selection ratio: 1.0399656264302082
sparse to dense row selection ratio: 2.8399492566071878
sparse to dense max calculation ratio: 1.2873744690000033
sparse to dense append ratio: 0.04224142526339283
the results relate to a matrice with mostly NaN
entries and random float entries at random positions. (see table above for exact numbers).
The ratios relate to: (time for usual(dense) matrice / time for sparse matrice)
. So, most operations get slower for sparse matrices. (exept from column appending -> i guess the sparse format reduces column appending to something really basic on the fly).
Most speed losses come from row selection -> but row-wise max calculation with build in max
is only 20 percent slower. So thats not really a proplem
The numbers turn in favor of the sparse matrice with column/row numbers increasing.
Since casting from dense to sparse is not too expensive, even for the harmonization methods, that apply rolling/resampling to the columns (wich makes the sparse dtype getting lost), the performance loss is not too big.
The result from column appending and column selection remains of data type sparse.
I guess even a minimum implementation, where we would cast the history of any variable, in case it got touched, to sparse, would improve memory usage vastly. And we would only have a performace loss of "function calls * cast-costs". Maybe a switch could turn this behavior on and of.
Only thing that may be implementation-wise complex, is to get rid of the initial -np.inf
column - since this column results in a lower bound for sparsity of 1/functions_calls
percent.