non-scalar fields implementation

A summary of the previous discussions (see #155 (closed) , #196 (closed)) and a discussion of the implementation strategy.

Rational

Currently we only allow scalar field values to our saqc-functions, e.g.:

saqc.flagFoo(field="x")

This has two undesirable consequences:

  1. It is inconvenient, if we want to apply the same test and parameters to several variables, as it leads to:
    saqc.flagFoo(field="x").flagFoo(field="y").flagFoo(field="z")
  2. We cannot support multivariate functions in a straight-forward way, as there is no possibility to specify multiple variables through the field parameter. Currently we solve this issue, by ignoring field altogether and by adding a separate function parameter, usually called fields. E.g.:
    saqc.flagBar(field="dummy", fields=["x", "y", "z"])

Goal

The overarching goal of this issue and all related MR is to allow non-scalar field values. In general we need to distinct two different kind of functions:

  • univariate functions: expect scalar field and target
  • multivariate functions: expect field and/or target to be non-scalar

Depending on the type of function and the arity of field and target we end up with different semantics, which boil down two the following basic strategies:

  • univariate functions: in case of non-scalar field/target values, the respective function will be called iteratively for every field-target pair.
  • multivariate functions: get field/target directly.

The following table shows the consequences of this behavior in more detail

id field target restrictions univariate mutivariate
1. scalar scalar type(field) == type(target) == str func(field, target): use field for computation write result to target func(field, target): use field for computation write result to target (not a real multivariate use case but a special case of 4.)
2. list list len(field) == len(target) func(f, t) for zip(field, target): call func for every target-field pair func(field, target): use all fields for the computation and write to all targets
3. scalar list - func(field, t) for t in target: write the same results to different targets func(field, target): write the same results to different targets (not a real multivariate use case, but a vectorized version the univariate idea)
4. list scalar - func(f, target) for f in field: compute all fields individually, write to always the same target (sort of nonsense, but not illegal) func(field, target): use all fields for computation, write result to a single target

Implementation

To implement the specified behavior, many parts of the system have to be touched and changed, that's why I'd like to break things down into several distinct work packages, most of them as separate MR:

0. Optional: Generic doc strings

Rational

The proposed implementation sequence implies repetitive changes to all doc strings (add description for target, change description for field), which is annoying and error prone. To simplify the process and make the resulting docs more consistent, we could add generic doc strings to programmatically add descriptions to the common function parameters data, flags and field (see also #109 (closed)). This is not mandatory, though!

1. Make the target parameter explicit

Rational

Currently we 'hide' the target parameter from the saqc functions, i.e. there is no target in the function signatures. The field/target semantics are implemented within the core, through a mechanism like:

if target != field:
    copy(field, target)
    field = target
saqc.flagFoo(field)

This is however only possible, if we can generate a 1:1 mappings from field to target, which is not the case for field: List[str]and target: str (it is not possible o decide which of the fields we need to copy).

Implementation

  • Add the target parameter to all function and their doc strings
  • Change all functions to write the processing result to data[target] and or flags[target]
  • Change the core target handling mechanism (as sketched above) to only ensure the target field is present in data and flags

2. Make field and target non-scalar

Rational

Most of our functions are inherently uni-variate, i.e. they are implemented to work on a single pd.Series that is usally sliced out of the passed data: DictOfSeries right at the beginning of the function body. As we don't want to rewrite all functions (yet?), we need to implement a mechanism, that converts something like:

saqc.flagFoo(field=["x", "y", "z"])

into three separate calls to flagFoo, one for each value in field

There are however some multivariate functions, where a call like

saqc.flagBar(field=["x", "y", "z"])

should not be expanded like above. In order to separate both cases, we add another boolean parameter to @register, that will be evaluated within the field expansion.

Implementation

  • Change field: str and target: str to field: Union[str, List[str]] and target: Union[str, List[str]]
  • Add a new @register parameter multivariate: bool = False
  • Add the necessary field expansion logic to core.core and core.register

3. Make the generic functions consistent

Rational

Currently the generic functions behave quite different from the 'real' test functions, as they break the usual relationship between field and target. Here we treat field as the actual target, as we write the results of the generic expression to field without necessarily using the field in the computation and instead infer field (already in a non-scalar fashion) from the parameters of the func parameter. To make things consistent, we came up with the following solution: Parameters of func will be mapped to the values of field, the result of func will be written to target.

4. Optional: Change the function signatures

Rational

This is not a strictly necessary change and would probably end up as part of work package 1. The idea is, to change the test function interface from

flagFoo(data, field, flags, *args, **kwargs)

to

flagFoo(data, flags, field, *args, **kwargs)

The only reason to attack this long standing annoyance of mine now, is to not end up with a function interface like

flagFoo(data, field, flags, target, *args, **kwargs)

and instead have the cleaner (both in terms of the API and the implementation) solution:

flagFoo(data, flags, field, target, *args, **kwargs)
Edited by David Schäfer