Skip to content
Snippets Groups Projects

Dataprocessing features

Merged Peter Lünenschloß requested to merge dataprocessing_features into develop

Major changes:

HARMONIZATION

  • no more heap
  • harmonization module is now a pure wrapper containing module functions to deliver the classic harmonization look and feel. The harm_something2grid operators are available under their old names and signatures. Additionally there is the harm_deharmonize wrapper.
  • The wrapper are composed as i recently suggested in the small moduling review session. Although the composed elements mainly do accountants tasks, i hope the better level of transparency and intuitivity is already visible
  • All config fields, set up with the old harmonization functions, should just continue to work

DATAPROCESSING MODULE / TS OPERATORS

  • The dataprocessing module now gathers functions used to "process" data. This includes interpolation, resampling, transformation, projection, dropping,...
  • One major improvement/change is, that now every kind of aggregation and resampling (what includes harmonization), both of data and flags series`, is delegated to one function (aggregate2freq) in the ts_operators module and both, data and flags, are resampled/aggregated in that function by the very same mechanism. So there is one and only one central point where for example, changes/parameters controlling validation aggregation behavior would have to virtue.
  • same holds for all kinds of Interpolations - all interpolation (mainly interpolation of inserted frequency grids and of nan values already present in the data) is done by the interpolateNANs method in ts_operators.
  • To tackle the upcoming tasks of counting nans and or BAD flagged values per aggregation/resampling interval, and make dependent resampling/aggregation from the results of that validations, there are now parameters max_total_valid and max_consec_valid in proc_resample, to control this behavior - passing a numpy nanfunc or a pandas func will now have no differing results, because all funcs get passed only valid values and the whole masking/validation is done in a seperate processing step inside aggregate2freq, by a call to this little fella living in the ts_operators. So validation granularity is increased and ambiguity is mitigated.

TESTING / DOCUMENTATION

  • I rewrote the old harmonization tests so that they would apply to the wrapper. As a result, the new harmonization mechanism is tested at the same level as the old.
  • dataprocessing is tested.
  • I added documentation to the data processing functions in a syntax i copied from the dios project.

QUESTIONS / DISCUSSION

  • @schaefed - i wanted to make available the ts_operators last, first and count in the config file. They are mainly dummies to trigger resample.count(), resample.first(), when passed to proc_resample - So i added them to the visitors environment. Uhm - that is the right place to do that, isnt it? Somehow wasnt sure
Edited by Peter Lünenschloß

Merge request reports

Pipeline #5303 passed

Pipeline passed for ddc6d65a on dataprocessing_features

Approval is optional

Merged by Peter LünenschloßPeter Lünenschloß 4 years ago (Jul 7, 2020 10:48am UTC)

Merge details

  • Changes merged into develop with 07532ac6 (commits were squashed).
  • Deleted the source branch.
  • Auto-merge enabled

Pipeline #5304 passed

Pipeline passed for 07532ac6 on develop

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
341 341 for f in drop_flags:
342 342 drop_mask |= flagger.isFlagged(field, flag=f)
343 343 return drop_mask
344
345
    • Author Owner

      ah yes, thought so too.

      But there is only pd.index.rename - wich renames the name of an index - not the index entries. Also, there is pd.DataFrame.rename - wich renames an entry in a dataframes columns - But we have a dios column that is of type index. index is not directly mutable and i found no built in method doing the job.

    • Please register or sign in to reply
  • added 2 commits

    • 7c314c6b - documentation clean up
    • 3e58c7b9 - added 'match to proc_resample - going home now'

    Compare with previous version

  • Bert Palm
    Bert Palm @palmb started a thread on the diff
  • 172 flagscol.drop(flagscol[drop_mask].index, inplace=True)
    173
    174 # hack ahead! Resampling with overlapping intervals:
    175 # 1. -> no rolling over categories allowed in pandas, so we translate manually:
    176 cats = pd.CategoricalIndex(flagger.dtype.categories, ordered=True)
    177 cats_dict = {cats[i]: i for i in range(0, len(cats))}
    178 flagscol = flagscol.replace(cats_dict)
    179 # 3. -> combine resample+rolling to resample with overlapping intervals:
    180 flagscol = flagscol.resample(freq).max()
    181 initial = flagscol[0]
    182 flagscol = flagscol.rolling(2, center=True, closed="neither").max()
    183 flagscol[0] = initial
    184 cats_dict = {num: key for (key, num) in cats_dict.items()}
    185 flagscol = flagscol.astype(int, errors='ignore').replace(cats_dict)
    186 flagscol[flagscol.isna()] = empty_intervals_flag
    187 # ...hack done
    • Maybe a bit less hackish, and i guess less expensive (if this plays a role)

      codes, cats = flagscol.factorize()
      intflags = pd.Series(codes, index=flagcol.index)
          
      # do the rolling stuff
      # ...
          
      remap = pd.Categorical.from_codes(intflags, cats)
      flagscol = pd.Series(remap)
      Edited by Bert Palm
    • codes, cats = flagscol.factorize()

      ah yes! never heard of that one. Will implement.

      less expensive (if this plays a role)

      hm - yes i think it does

    • I didnt change to .factorize(), because, on the one hand, it is not noticably faster, and on the other hand, it also gets hacky with .factorize, because factorization only applies for the values present in flagscol. So, when calculating back the codes, one would have to add codes for flags that where not present in the flagscol when factorizing, but where part of the flags-dtype.

      So, if it would be OK, i would just leave the snippet as it is. Objections?

    • Please register or sign in to reply
  • added 14 commits

    • 45baa07c - implemented polynomial modelling function/initialised modelling module/added...
    • 0a813ccf - bfx@modelling_polyFit
    • ed8fed6d - poly fitting function implemented and documented
    • ab0c62ca - Grubbs outlier test implemented
    • ba2a9fd1 - grubs test documented
    • dc26a553 - added outlier utils to the requirements
    • c80b919e - test module for data modelling function under construction
    • 206cefc9 - data modelling test module implemented/ bugfixes in polynomial fit
    • 6c5a9425 - test for grubbs test implemented
    • 49fff707 - Update requirements.txt
    • 2072113b - Merge branch 'modellingModule' of https://git.ufz.de/rdm-software/saqc into modellingModule
    • 89e68c3a - solved outlier package requiremnts confusion
    • 0ebca7df - added modelling module to the init
    • e59bf348 - still requirements confusion

    Compare with previous version

  • added 1 commit

    • 3035e92c - drift correcture implemented

    Compare with previous version

  • added 1 commit

    Compare with previous version

  • The harmonization - deharmonization concept isn't very clear to me.

    currently i try to undo a harm_linear2Grid(). there i do not need to specify a method, and it uses method='time' under the hood, but i dont get the deharmonization run properly, as i dont know the correct value to pass to method=??.. 'inverse_time' ? do i miss something 😕 ?

  • this is a very big merge request.. i suggest to not add more here (except fixes), and do further work on a new branch forked from this, because its gets harder and harder to review

    Edited by Bert Palm
    • 'inverse_time' ? do i miss something

      the key word to pass would be "inverse_nagg" - that is a special case for interpolations. I will add an "inverse interpolation" keyword that maps onto "inverse_nagg". For the other harmonization methods i think the inversion method to choose is quite clear - since they have method keywords that can just be prefixed by "inverse_" to get the inversion method.

      The harmonization - deharmonization concept isn't very clear to me.

      Anything else unclear? Maybe we can have a VC talk on that. Or a real live talk as well

      i suggest to not add more here (except fixes), and do further work on a new branch forked from this

      Okidoki - but it makes me nervous to work on a somehow unauthorized branch and also having to merge develop repeatedly and solving merge conflicts again and again gets kind of annoying on the long run - so we maybe could meet and go through the request togethter, to bring this thing to an end soon?

      Edited by Peter Lünenschloß
    • i try to make the review today, if i get stuck i call for a meeting :D

    • Please register or sign in to reply
  • Bert Palm
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading