Data Masking (@chs)
For the CHS pipeline we want to implement a masking-tool that is easy to comprehend and to handle. There is currently a workflow by wich custom masking for custom functions can be achieved in saqc. The problem with that workflow is, that it is quite technical and verbose. So the issue of this issue is to find an user friendly way to wrap the workflow up.
Current workflow (assume we want to apply flagFunc onto a masked chunk of field):
-
proc_fork
(make a copy of data[field], to be able to regain the masked flags later) -
modelling_mask
(in field: replace the flags of values-to-mask by nan) -
flagFunc
(apply one ore a series of flagging functions onto the masked data) -
proc_projectFlags
(project flagging results for the masked data onto the original data and by this regain original flags for the nan-flags) -
drop
(drop the masked version of data[field] from the data.)
So that is quite a lot of saqc-chinese to master, in order to perform a common and drive by task as masking is supposed to be.
Although points 1,2,4 and 5 can easily be wrapped up, point 3 can not, because the names and arguments of the functions to be called on the masked data would need to be parameters to that wrapper.
First and easiest solution i could think of, would be to distribute 2 wrappers, mask
and unmask
. Workflow would be as follows:
-
mask
(internally perform steps 1 and 2) -
flagFunc
-
unmask
(internally perform steps 4 and 5)
Still it is a little bit peculiar, since you would always have to explain to a user why he must never forget to unmask after masking, if he not wants his original (masked) flags to get lost.
The solution i prefer over this, would be to introduce a new keyword - mask
- to the **kwargs
-flow through the funcs, wich would automatically trigger the preceeding call of mask
and the succeeding call of unmask
for any function it is passed to.
I do not have enough instant core
-insight, to assess, if such a thing would be possible to implement straight forwardly - but the idea would be, that calling:
flagFunc(field, bla, bli, mask={season_start=24:00, season_end=06:00})
results in
mask(field, season=mask)
flagFunc(field)
unmask(field, season=mask)
actually being called.