reduce memory consumption by relaxing immutability considerations
This MR introduces changes to our model of data copying. As the title indicates the general model of immutability of data is relaxed, by removing the data copies in between function calls. This reduces the main memory consumption significantly (from >15 GB to ~5 GB for a CHS-pipeline use case with 1 minute resolution data) but also makes changes to data visible when calling saqc functions directly.
Consider the following dummy code:
from saqc.funcs import dataProcessingFunction
data = ...
data_saqc, flags = dataProcessingFunction(data, field, flags)
Here, the returned value data_saqc is a reference to data, so any changes made in dataProcessingFunction will be visible in data as well.
On the API level however, we circumvent this issue by copying the passed data first. Consider the following example:
from saqc import SaQC
data = ...
saqc = SaQC(data).dataProcessingFunction(field)
data_saqc, flags = saqc.getResult()
Here data_saqc is a copy of data, i.e. changed made in dataProcessingFunction are not visible in data.
I would love to not make this change, in order to keep our current immutability promises (something I really appreciate). However, the excessive memory consumption we currently have, significantly reduces the usability of saqc for larger datasets. That's why I think this change (our something similar) is necessary.