Proper name mangling for system generated temporary variables needed
Currently we generate intermediate/temporary variables during harmonization. They store the original/unchanged data in order to allow a later deharmonization. The naming scheme we use is rather simplistic, as we only append _original
to a given field
-name. There are at least to problems with that:
- Trying to harmonize a variable multiple times, fails (see minimal example below)
- Using regular expressions as field selectors might result unexpected behavior, as a pattern like
".*"
would also match all intermediate variables.
So, we need to fix this soon! Either, we come up, with a proper name mangling scheme or we try to get the intermediate variables out of SaQC._data
. Both have the specific advantages and disadvantages, so I think we need a dedicated discussion for that.
Minimal failure example to 1.:
import pandas as pd
import numpy as np
from saqc import SaQC, SimpleFlagger
data=pd.DataFrame({"x": np.arange(10)}, index=pd.date_range("2012-01-01", periods=10, freq="20Min"))
saqc = SaQC(data=data, flagger=SimpleFlagger())
saqc = (saqc
.resampling.linear("x", freq="10Min")
.resampling.aggregate("x", freq="20Min", value_func=np.mean))
saqc.getResult()
Error:
ValueError: x: field already exist
The problem is, that all resampling functions call tools.copy
to copy field
to field + "_original"
, but tools.copy
fails, if field + "_original"
already exists (and rightly so). I think we should either support multiple harmonization on one variable (+1) or give a more meaningful error message (-1).
Maybe we could also use this opportunity to think about a more elaborated renaming scheme, than appending _original
. I could imagine datasets, where columns are named *_original
and then we don't have any harmo at all. Maybe we find a schema that creates names, that are random enough, that no sane input dataset uses them as variables names but are also detect and reversable.