non-scalar fields implementation
A summary of the previous discussions (see #155 (closed) , #196 (closed)) and a discussion of the implementation strategy.
Rational
Currently we only allow scalar field
values to our saqc-functions, e.g.:
saqc.flagFoo(field="x")
This has two undesirable consequences:
- It is inconvenient, if we want to apply the same test and parameters to several variables, as it leads to:
saqc.flagFoo(field="x").flagFoo(field="y").flagFoo(field="z")
- We cannot support multivariate functions in a straight-forward way, as there is no possibility to specify multiple variables through the
field
parameter. Currently we solve this issue, by ignoringfield
altogether and by adding a separate function parameter, usually calledfields
. E.g.:saqc.flagBar(field="dummy", fields=["x", "y", "z"])
Goal
The overarching goal of this issue and all related MR is to allow non-scalar field
values. In general we need to distinct two different kind of functions:
- univariate functions: expect scalar
field
andtarget
- multivariate functions: expect
field
and/ortarget
to be non-scalar
Depending on the type of function and the arity of field
and target
we end up with different semantics, which boil down two the following basic strategies:
- univariate functions: in case of non-scalar
field
/target
values, the respective function will be called iteratively for everyfield
-target
pair. - multivariate functions: get
field
/target
directly.
The following table shows the consequences of this behavior in more detail
id | field |
target |
restrictions | univariate | mutivariate |
---|---|---|---|---|---|
1. | scalar | scalar | type(field) == type(target) == str |
func(field, target) : use field for computation write result to target
|
func(field, target) : use field for computation write result to target (not a real multivariate use case but a special case of 4.) |
2. | list | list | len(field) == len(target) |
func(f, t) for zip(field, target) : call func for every target -field pair |
func(field, target) : use all field s for the computation and write to all target s |
3. | scalar | list | - |
func(field, t) for t in target : write the same results to different target s |
func(field, target) : write the same results to different target s (not a real multivariate use case, but a vectorized version the univariate idea) |
4. | list | scalar | - |
func(f, target) for f in field : compute all fields individually, write to always the same target (sort of nonsense, but not illegal) |
func(field, target) : use all field s for computation, write result to a single target
|
Implementation
To implement the specified behavior, many parts of the system have to be touched and changed, that's why I'd like to break things down into several distinct work packages, most of them as separate MR:
0. Optional: Generic doc strings
Rational
The proposed implementation sequence implies repetitive changes to all doc strings (add description for target
, change description for field
), which is annoying and error prone. To simplify the process and make the resulting docs more consistent, we could add generic doc strings to programmatically add descriptions to the common function parameters data
, flags
and field
(see also #109 (closed)). This is not mandatory, though!
target
parameter explicit
1. Make the Rational
Currently we 'hide' the target parameter from the saqc functions, i.e. there is no target
in the function signatures. The field
/target
semantics are implemented within the core, through a mechanism like:
if target != field:
copy(field, target)
field = target
saqc.flagFoo(field)
This is however only possible, if we can generate a 1:1 mappings from field
to target
, which is not the case for field: List[str]
and target: str
(it is not possible o decide which of the field
s we need to copy).
Implementation
- Add the
target
parameter to all function and their doc strings - Change all functions to write the processing result to
data[target]
and orflags[target]
- Change the core
target
handling mechanism (as sketched above) to only ensure thetarget
field is present indata
andflags
field
and target
non-scalar
2. Make Rational
Most of our functions are inherently uni-variate, i.e. they are implemented to work on a single pd.Series
that is usally sliced out of the passed data: DictOfSeries
right at the beginning of the function body. As we don't want to rewrite all functions (yet?), we need to implement a mechanism, that converts something like:
saqc.flagFoo(field=["x", "y", "z"])
into three separate calls to flagFoo
, one for each value in field
There are however some multivariate functions, where a call like
saqc.flagBar(field=["x", "y", "z"])
should not be expanded like above. In order to separate both cases, we add another boolean parameter to @register
, that will be evaluated within the field
expansion.
Implementation
- Change
field: str
andtarget: str
tofield: Union[str, List[str]]
andtarget: Union[str, List[str]]
- Add a new
@register
parametermultivariate: bool = False
- Add the necessary
field
expansion logic tocore.core
andcore.register
3. Make the generic functions consistent
Rational
Currently the generic
functions behave quite different from the 'real' test functions, as they break the usual relationship between field
and target
. Here we treat field
as the actual target
, as we write the results of the generic expression to field
without necessarily using the field
in the computation and instead infer field
(already in a non-scalar fashion) from the parameters of the func
parameter. To make things consistent, we came up with the following solution: Parameters of func
will be mapped to the values of field
, the result of func
will be written to target
.
4. Optional: Change the function signatures
Rational
This is not a strictly necessary change and would probably end up as part of work package 1. The idea is, to change the test function interface from
flagFoo(data, field, flags, *args, **kwargs)
to
flagFoo(data, flags, field, *args, **kwargs)
The only reason to attack this long standing annoyance of mine now, is to not end up with a function interface like
flagFoo(data, field, flags, target, *args, **kwargs)
and instead have the cleaner (both in terms of the API and the implementation) solution:
flagFoo(data, flags, field, target, *args, **kwargs)