Proper name mangling for system generated temporary variables needed

Currently we generate intermediate/temporary variables during harmonization. They store the original/unchanged data in order to allow a later deharmonization. The naming scheme we use is rather simplistic, as we only append _original to a given field-name. There are at least to problems with that:

Trying to harmonize a variable multiple times, fails (see minimal example below)
Using regular expressions as field selectors might result unexpected behavior, as a pattern like ".*"
would also match all intermediate variables.

So, we need to fix this soon! Either, we come up, with a proper name mangling scheme or we try to get the intermediate variables out of SaQC._data. Both have the specific advantages and disadvantages, so I think we need a dedicated discussion for that.

Minimal failure example to 1.:

import pandas as pd
import numpy as np
from saqc import SaQC, SimpleFlagger

data=pd.DataFrame({"x": np.arange(10)}, index=pd.date_range("2012-01-01", periods=10, freq="20Min"))
saqc = SaQC(data=data, flagger=SimpleFlagger())
saqc = (saqc
        .resampling.linear("x", freq="10Min")
        .resampling.aggregate("x", freq="20Min", value_func=np.mean))
saqc.getResult()

Error:

ValueError: x: field already exist

The problem is, that all resampling functions call tools.copy to copy field to field + "_original", but tools.copy fails, if field + "_original" already exists (and rightly so). I think we should either support multiple harmonization on one variable (+1) or give a more meaningful error message (-1).

Maybe we could also use this opportunity to think about a more elaborated renaming scheme, than appending _original. I could imagine datasets, where columns are named *_original and then we don't have any harmo at all. Maybe we find a schema that creates names, that are random enough, that no sane input dataset uses them as variables names but are also detect and reversable.

Edited Mar 25, 2021 by David Schäfer