Memory & GC
Problem
We bought flexibility regarding time steps, time interpolation etc. at the cost of memory consumption and instantiation of a lot of arrays.
Currently, outputs accept data arrays only as "owned" (in Rust terms). I.e. coponents or adapters should not modify pushed data afterwards. The entire data forwarding is based on passing these arrays through the chain, and creating new arrays at calculations (e.g. time interpolation).
Also, depending on the pull interval, outputs or time adapters may hold a considerable amount of pushed arrays.
This could lead to high memory consumption, and requires work by the garbage collector.
The new scheduling algorithm makes the issue worse, as it is not guaranteed anymore that a pull is in the range of the last two pushes.
Is this really a problem?
Examples
Germany, 1 km resolution -> 860x630 cells -> 4.4 MB with numpy.float64
- 30 grids -> 130 MB
- 365 grids -> 1.6 GB
EU, 4 km resolution -> 1875x1375 cells -> 20.6 MB with numpy.float64
- 30 grids -> 620 MB
- 365 grids -> 7.5 GB
Possible solutions
- Ignore it until it really becomes a problem
- Re-think the entire timing/step size stuff and be much more restrictive (unlikely)
- Store large stacks of rasters in files instead of RAM (using Dask or np.memmap?) (see !238 (merged) for a draft implementation)
- Some kinds of coupling, e.g. with an equal time step and no adapters, could use a more memory-conserving method with writing to and reading from a shared array.
- Allow linkage over MPI and distribute memory over multiple nodes
- Inform time adapters about the next expected pull -- would allow to aggregate data inplace and keep only data after the next pull (which may happen due to dependency scheduling)