Skip to content

reduce memory consumption by relaxing immutability considerations

David Schäfer requested to merge memsave into develop

This MR introduces changes to our model of data copying. As the title indicates the general model of immutability of data is relaxed, by removing the data copies in between function calls. This reduces the main memory consumption significantly (from >15 GB to ~5 GB for a CHS-pipeline use case with 1 minute resolution data) but also makes changes to data visible when calling saqc functions directly.

Consider the following dummy code:

from saqc.funcs import dataProcessingFunction

data = ...
data_saqc, flags = dataProcessingFunction(data, field, flags) 

Here, the returned value data_saqc is a reference to data, so any changes made in dataProcessingFunction will be visible in data as well.

On the API level however, we circumvent this issue by copying the passed data first. Consider the following example:

from saqc import SaQC

data = ...
saqc = SaQC(data).dataProcessingFunction(field)
data_saqc, flags = saqc.getResult()

Here data_saqc is a copy of data, i.e. changed made in dataProcessingFunction are not visible in data.

I would love to not make this change, in order to keep our current immutability promises (something I really appreciate). However, the excessive memory consumption we currently have, significantly reduces the usability of saqc for larger datasets. That's why I think this change (our something similar) is necessary.

Merge request reports