Behviour of concatFlags when merging interpolated and original time series
@schmidle brought up the following topic in a discussion with together with @palmb :
When using the common resampling - flagging - backprojecting workflow like in the following snippet:
qc = qc.shift("data", target="data_shift", method="fshift", freq="10Min")
qc = qc.flagRange("data_shift", max=1)
qc = qc.concatFlags("data_shift", target="data", method="inverse_fshift")
concatFlags
only projects given flags to the nearest observation in the target
timeseries. If SaQC.shift
reduced several values within the freq
window to single value the concatFlags
results in unflagged values. Consider the following code:
df = pd.DataFrame(
data=[0, 3, 5],
columns=["data"],
index=pd.to_datetime(
[
"2020-01-01 08:00",
"2020-01-01 08:07",
"2020-01-01 08:08",
]
),
)
qc = SaQC(df)
qc = qc.shift("data", freq="1Min", target="data_shift", method="fshift")
qc = qc.flagRange("data_shift", max=1)
qc = qc.concatFlags("data_shift", target="data", method="inverse_fshift")
which produces the following output
>>> qc.data
data | data_shift |
======================== | ============================== |
2020-01-01 08:00:00 0 | 2020-01-01 08:00:00 0 |
2020-01-01 08:07:00 3 | 2020-01-01 08:10:00 5 |
2020-01-01 08:08:00 5 | |
>>> qc.flags
data | data_shift |
========================== | ============================== |
2020-01-01 08:00:00 -inf | 2020-01-01 08:00:00 -inf |
2020-01-01 08:07:00 -inf | 2020-01-01 08:10:00 255.0 |
2020-01-01 08:08:00 255.0 | |
We can see, that only the observation 2020-01-01 08:08:00
with value 5
of "data"
got flagged although observation 2020-01-01 08:07:00
would have also met the flagRange
criteria. This in unfortunate, but AFAIKT not really solvable in a consistent manner. If we would fill the entire projection interval, we of course would also flag 2020-01-01 08:07:00
but we can't be sure, that that is the right thing to do either (if 2020-01-01 08:07:00
had the value 0
, than flagging it wouldn't be correct). So we are basically stuck with something, that works correctly in a sense, that it only makes decidable decisions but is anyhow counterintuitive :-( To deligate the decision on what to do, we could implement a switch to concatFlags
to control, if the interval or only the closest value should receive the flag.
One workaround I already mentioned is to decrease the shift
frequency to ensure, that there are no intervals with multiple values left. This solves the issue as the following snippet shows, but might lead to runtime or memory issues.
qc = SaQC(df)
qc = qc.shift("data", freq="1Min", target="data_shift", method="fshift")
qc = qc.flagRange("data_shift", max=1)
qc = qc.concatFlags("data_shift", target="data", method="inverse_fshift")
>>> qc.data
data | data_shift |
======================== | ============================== |
2020-01-01 08:00:00 0 | 2020-01-01 08:00:00 0.0 |
2020-01-01 08:07:00 3 | 2020-01-01 08:01:00 0.0 |
2020-01-01 08:08:00 5 | 2020-01-01 08:02:00 NaN |
| 2020-01-01 08:03:00 NaN |
| 2020-01-01 08:04:00 NaN |
| 2020-01-01 08:05:00 NaN |
| 2020-01-01 08:06:00 NaN |
| 2020-01-01 08:07:00 3.0 |
| 2020-01-01 08:08:00 5.0 |
>>> qc.flags
data | data_shift |
========================== | ============================== |
2020-01-01 08:00:00 -inf | 2020-01-01 08:00:00 -inf |
2020-01-01 08:07:00 255.0 | 2020-01-01 08:01:00 -inf |
2020-01-01 08:08:00 255.0 | 2020-01-01 08:02:00 -inf |
| 2020-01-01 08:03:00 -inf |
| 2020-01-01 08:04:00 -inf |
| 2020-01-01 08:05:00 -inf |
| 2020-01-01 08:06:00 -inf |
| 2020-01-01 08:07:00 255.0 |
| 2020-01-01 08:08:00 255.0 |