reduce memory consumption by relaxing immutability considerations
This MR introduces changes to our model of data
copying. As the title indicates the general model of immutability of data
is relaxed, by removing the data
copies in between function calls. This reduces the main memory consumption significantly (from >15 GB to ~5 GB for a CHS-pipeline use case with 1 minute resolution data) but also makes changes to data
visible when calling saqc
functions directly.
Consider the following dummy code:
from saqc.funcs import dataProcessingFunction
data = ...
data_saqc, flags = dataProcessingFunction(data, field, flags)
Here, the returned value data_saqc
is a reference to data
, so any changes made in dataProcessingFunction
will be visible in data
as well.
On the API level however, we circumvent this issue by copying the passed data
first. Consider the following example:
from saqc import SaQC
data = ...
saqc = SaQC(data).dataProcessingFunction(field)
data_saqc, flags = saqc.getResult()
Here data_saqc
is a copy of data
, i.e. changed made in dataProcessingFunction
are not visible in data
.
I would love to not make this change, in order to keep our current immutability promises (something I really appreciate). However, the excessive memory consumption we currently have, significantly reduces the usability of saqc
for larger datasets. That's why I think this change (our something similar) is necessary.