numba riddance

getting rid of numba dependence seems to be desirable since its conflicting potential

I went through our numba-use instances and checked performance gain from numba and alternatives/workarounds. I think removal is doable without much significant performance loss (in center use cases):

  1. flagzScore:

Numba engine is required when rolling with multicolumn windows to get multicolumn sample statistics/scores (axis=1):

  • I replaced calls to .rolling with numba engine in !700 (merged), with a custom numpy based roller that is at least as efficient (no jitting overhead) in the case of harmonized data
  • in Case of not-harmonized data and sample size > 500.000 performance loss factor is about 20
  • since use case with axis=1 and not-harmonized data is extremely rare if at all used, i think the performance loss is acceptable
  1. fitPolynomial

with growing sample size performance loss from not using numba grows:

series length ~factor total_time
100.000 1
1.000.000 ~5 16s/90s
10.000.000 ~11 67s/918s
  • Numba boost kicks in only at around 200.000 samples and growth slowly (in the factor)
  • For modelling data, usage of spectral based fit (lowPassFilter) is more easy to parametrize and much more fast, so we could just make that the main modelling function and not support optimized polynomial fitting anymore
  1. flagChangepoints

I checked for detection of jumps

series length ~factor total_time
100.000 1
1.000.000 10 2s/12s
10.000.000 10 15s/113s
  • Numba boost seems to be capped at 10 and kicks in at about 200.000 samples
  • since flag jumps (and other basic changepoints tasks) just compare mean (or other basic) statistics, those calls could be dispatched to built-in numoy functions
  • performance loss only remains for changepoint tasks where statistics are compared that are not so basic to be built in but not too complex/exotic to not be jit-able
  • so the loss of removal of numba is neglectable
  1. flagRaise

I didnt check for performance loss for flagRaise explecitly.

  • the function is never ever used, i implemented it for a demand occuring in an early GCEF-flagging session
  • flagUniLOF is more reliable in achieving what flagRaise trys to, and also much easier to parametrize
  • so i would be OK with removing numba from flagRaise or even deprecating the function at all (to keep the toolbox slim)
  1. _exceedConsecutiveNanLimit

Some helper function containing a loop to count consecutive NaNs. The jitted loop can be replaced by an equally performative call to np.lib.stride_tricks.sliding_window_views:

np.isnan(np.lib.stride_tricks.sliding_window_view(arr, window_shape=max_consec)).all(axis=1).any(
Edited by Peter Lünenschloß