numba riddance
getting rid of numba dependence seems to be desirable since its conflicting potential
I went through our numba-use instances and checked performance gain from numba and alternatives/workarounds. I think removal is doable without much significant performance loss (in center use cases):
-
flagzScore
:
Numba engine is required when rolling with multicolumn windows to get multicolumn sample statistics/scores (axis=1):
- I replaced calls to
.rolling
with numba engine in !700 (merged), with a custom numpy based roller that is at least as efficient (no jitting overhead) in the case of harmonized data - in Case of not-harmonized data and sample size > 500.000 performance loss factor is about 20
- since use case with axis=1 and not-harmonized data is extremely rare if at all used, i think the performance loss is acceptable
fitPolynomial
with growing sample size performance loss from not using numba grows:
series length | ~factor | total_time |
---|---|---|
100.000 | 1 | |
1.000.000 | ~5 | 16s/90s |
10.000.000 | ~11 | 67s/918s |
- Numba boost kicks in only at around 200.000 samples and growth slowly (in the factor)
- For modelling data, usage of spectral based fit (
lowPassFilter
) is more easy to parametrize and much more fast, so we could just make that the main modelling function and not support optimized polynomial fitting anymore
flagChangepoints
I checked for detection of jumps
series length | ~factor | total_time |
---|---|---|
100.000 | 1 | |
1.000.000 | 10 | 2s/12s |
10.000.000 | 10 | 15s/113s |
- Numba boost seems to be capped at 10 and kicks in at about 200.000 samples
- since flag jumps (and other basic changepoints tasks) just compare mean (or other basic) statistics, those calls could be dispatched to built-in numoy functions
- performance loss only remains for changepoint tasks where statistics are compared that are not so basic to be built in but not too complex/exotic to not be jit-able
- so the loss of removal of numba is neglectable
flagRaise
I didnt check for performance loss for flagRaise
explecitly.
- the function is never ever used, i implemented it for a demand occuring in an early GCEF-flagging session
-
flagUniLOF
is more reliable in achieving whatflagRaise
trys to, and also much easier to parametrize - so i would be OK with removing numba from
flagRaise
or even deprecating the function at all (to keep the toolbox slim)
_exceedConsecutiveNanLimit
Some helper function containing a loop to count consecutive NaNs. The jitted loop can be replaced by an equally performative call to np.lib.stride_tricks.sliding_window_views
:
np.isnan(np.lib.stride_tricks.sliding_window_view(arr, window_shape=max_consec)).all(axis=1).any(
Edited by Peter Lünenschloß