non-scalar fields implementation
A summary of the previous discussions (see #155 (closed) , #196 (closed)) and a discussion of the implementation strategy.
Rational
Currently we only allow scalar field values to our saqc-functions, e.g.:
saqc.flagFoo(field="x")
This has two undesirable consequences:
- It is inconvenient, if we want to apply the same test and parameters to several variables, as it leads to:
saqc.flagFoo(field="x").flagFoo(field="y").flagFoo(field="z") - We cannot support multivariate functions in a straight-forward way, as there is no possibility to specify multiple variables through the
fieldparameter. Currently we solve this issue, by ignoringfieldaltogether and by adding a separate function parameter, usually calledfields. E.g.:saqc.flagBar(field="dummy", fields=["x", "y", "z"])
Goal
The overarching goal of this issue and all related MR is to allow non-scalar field values. In general we need to distinct two different kind of functions:
- univariate functions: expect scalar
fieldandtarget - multivariate functions: expect
fieldand/ortargetto be non-scalar
Depending on the type of function and the arity of field and target we end up with different semantics, which boil down two the following basic strategies:
- univariate functions: in case of non-scalar
field/targetvalues, the respective function will be called iteratively for everyfield-targetpair. - multivariate functions: get
field/targetdirectly.
The following table shows the consequences of this behavior in more detail
| id | field |
target |
restrictions | univariate | mutivariate |
|---|---|---|---|---|---|
| 1. | scalar | scalar | type(field) == type(target) == str |
func(field, target): use field for computation write result to target
|
func(field, target): use field for computation write result to target (not a real multivariate use case but a special case of 4.) |
| 2. | list | list | len(field) == len(target) |
func(f, t) for zip(field, target): call func for every target-field pair |
func(field, target): use all fields for the computation and write to all targets |
| 3. | scalar | list | - |
func(field, t) for t in target: write the same results to different targets |
func(field, target): write the same results to different targets (not a real multivariate use case, but a vectorized version the univariate idea) |
| 4. | list | scalar | - |
func(f, target) for f in field: compute all fields individually, write to always the same target (sort of nonsense, but not illegal) |
func(field, target): use all fields for computation, write result to a single target
|
Implementation
To implement the specified behavior, many parts of the system have to be touched and changed, that's why I'd like to break things down into several distinct work packages, most of them as separate MR:
0. Optional: Generic doc strings
Rational
The proposed implementation sequence implies repetitive changes to all doc strings (add description for target, change description for field), which is annoying and error prone. To simplify the process and make the resulting docs more consistent, we could add generic doc strings to programmatically add descriptions to the common function parameters data, flags and field (see also #109 (closed)). This is not mandatory, though!
1. Make the target parameter explicit
Rational
Currently we 'hide' the target parameter from the saqc functions, i.e. there is no target in the function signatures. The field/target semantics are implemented within the core, through a mechanism like:
if target != field:
copy(field, target)
field = target
saqc.flagFoo(field)
This is however only possible, if we can generate a 1:1 mappings from field to target, which is not the case for field: List[str]and target: str (it is not possible o decide which of the fields we need to copy).
Implementation
- Add the
targetparameter to all function and their doc strings - Change all functions to write the processing result to
data[target]and orflags[target] - Change the core
targethandling mechanism (as sketched above) to only ensure thetargetfield is present indataandflags
2. Make field and target non-scalar
Rational
Most of our functions are inherently uni-variate, i.e. they are implemented to work on a single pd.Series that is usally sliced out of the passed data: DictOfSeries right at the beginning of the function body. As we don't want to rewrite all functions (yet?), we need to implement a mechanism, that converts something like:
saqc.flagFoo(field=["x", "y", "z"])
into three separate calls to flagFoo, one for each value in field
There are however some multivariate functions, where a call like
saqc.flagBar(field=["x", "y", "z"])
should not be expanded like above. In order to separate both cases, we add another boolean parameter to @register, that will be evaluated within the field expansion.
Implementation
- Change
field: strandtarget: strtofield: Union[str, List[str]]andtarget: Union[str, List[str]] - Add a new
@registerparametermultivariate: bool = False - Add the necessary
fieldexpansion logic tocore.coreandcore.register
3. Make the generic functions consistent
Rational
Currently the generic functions behave quite different from the 'real' test functions, as they break the usual relationship between field and target. Here we treat field as the actual target, as we write the results of the generic expression to field without necessarily using the field in the computation and instead infer field (already in a non-scalar fashion) from the parameters of the func parameter. To make things consistent, we came up with the following solution: Parameters of func will be mapped to the values of field, the result of func will be written to target.
4. Optional: Change the function signatures
Rational
This is not a strictly necessary change and would probably end up as part of work package 1. The idea is, to change the test function interface from
flagFoo(data, field, flags, *args, **kwargs)
to
flagFoo(data, flags, field, *args, **kwargs)
The only reason to attack this long standing annoyance of mine now, is to not end up with a function interface like
flagFoo(data, field, flags, target, *args, **kwargs)
and instead have the cleaner (both in terms of the API and the implementation) solution:
flagFoo(data, flags, field, target, *args, **kwargs)