Dataprocessing features
Major changes:
- no more heap
- harmonization module is now a pure wrapper containing module functions to deliver the classic harmonization look and feel. The harm_something2grid operators are available under their old names and signatures. Additionally there is the harm_deharmonize wrapper.
- The wrapper are composed as i recently suggested in the small moduling review session. Although the composed elements mainly do accountants tasks, i hope the better level of transparency and intuitivity is already visible
- All config fields, set up with the old harmonization functions, should just continue to work
DATAPROCESSING MODULE / TS OPERATORS
- The dataprocessing module now gathers functions used to "process" data. This includes interpolation, resampling, transformation, projection, dropping,...
- One major improvement/change is, that now every kind of aggregation and resampling (what includes harmonization), both of data and flags series`, is delegated to one function (aggregate2freq) in the ts_operators module and both, data and flags, are resampled/aggregated in that function by the very same mechanism. So there is one and only one central point where for example, changes/parameters controlling validation aggregation behavior would have to virtue.
- same holds for all kinds of Interpolations - all interpolation (mainly interpolation of inserted frequency grids and of nan values already present in the data) is done by the interpolateNANs method in ts_operators.
- To tackle the upcoming tasks of counting nans and or BAD flagged values per aggregation/resampling interval, and make dependent resampling/aggregation from the results of that validations, there are now parameters max_total_valid and max_consec_valid in proc_resample, to control this behavior - passing a numpy nanfunc or a pandas func will now have no differing results, because all funcs get passed only valid values and the whole masking/validation is done in a seperate processing step inside aggregate2freq, by a call to this little fella living in the ts_operators. So validation granularity is increased and ambiguity is mitigated.
TESTING / DOCUMENTATION
- I rewrote the old harmonization tests so that they would apply to the wrapper. As a result, the new harmonization mechanism is tested at the same level as the old.
- dataprocessing is tested.
- I added documentation to the data processing functions in a syntax i copied from the dios project.
QUESTIONS / DISCUSSION
- @schaefed - i wanted to make available the ts_operators last, first and count in the config file. They are mainly dummies to trigger resample.count(), resample.first(), when passed to proc_resample - So i added them to the visitors environment. Uhm - that is the right place to do that, isnt it? Somehow wasnt sure
Merge request reports
Activity
added internal architecture label
added 1 commit
- 29fe3072 - updated deharm to return deharmed data at 'field'
ah yes, thought so too.
But there is only pd.index.rename - wich renames the name of an index - not the index entries. Also, there is pd.DataFrame.rename - wich renames an entry in a dataframes columns - But we have a dios column that is of type index. index is not directly mutable and i found no built in method doing the job.
added 2 commits
- saqc/funcs/proc_functions.py 0 → 100644
172 flagscol.drop(flagscol[drop_mask].index, inplace=True) 173 174 # hack ahead! Resampling with overlapping intervals: 175 # 1. -> no rolling over categories allowed in pandas, so we translate manually: 176 cats = pd.CategoricalIndex(flagger.dtype.categories, ordered=True) 177 cats_dict = {cats[i]: i for i in range(0, len(cats))} 178 flagscol = flagscol.replace(cats_dict) 179 # 3. -> combine resample+rolling to resample with overlapping intervals: 180 flagscol = flagscol.resample(freq).max() 181 initial = flagscol[0] 182 flagscol = flagscol.rolling(2, center=True, closed="neither").max() 183 flagscol[0] = initial 184 cats_dict = {num: key for (key, num) in cats_dict.items()} 185 flagscol = flagscol.astype(int, errors='ignore').replace(cats_dict) 186 flagscol[flagscol.isna()] = empty_intervals_flag 187 # ...hack done Maybe a bit less hackish, and i guess less expensive (if this plays a role)
codes, cats = flagscol.factorize() intflags = pd.Series(codes, index=flagcol.index) # do the rolling stuff # ... remap = pd.Categorical.from_codes(intflags, cats) flagscol = pd.Series(remap)
Edited by Bert PalmI didnt change to
.factorize()
, because, on the one hand, it is not noticably faster, and on the other hand, it also gets hacky with.factorize
, because factorization only applies for the values present in flagscol. So, when calculating back the codes, one would have to add codes for flags that where not present in the flagscol when factorizing, but where part of the flags-dtype.So, if it would be OK, i would just leave the snippet as it is. Objections?
added 14 commits
- 45baa07c - implemented polynomial modelling function/initialised modelling module/added...
- 0a813ccf - bfx@modelling_polyFit
- ed8fed6d - poly fitting function implemented and documented
- ab0c62ca - Grubbs outlier test implemented
- ba2a9fd1 - grubs test documented
- dc26a553 - added outlier utils to the requirements
- c80b919e - test module for data modelling function under construction
- 206cefc9 - data modelling test module implemented/ bugfixes in polynomial fit
- 6c5a9425 - test for grubbs test implemented
- 49fff707 - Update requirements.txt
- 2072113b - Merge branch 'modellingModule' of https://git.ufz.de/rdm-software/saqc into modellingModule
- 89e68c3a - solved outlier package requiremnts confusion
- 0ebca7df - added modelling module to the init
- e59bf348 - still requirements confusion
Toggle commit list- Resolved by Peter Lünenschloß
- Resolved by Peter Lünenschloß
- Resolved by Peter Lünenschloß
The harmonization - deharmonization concept isn't very clear to me.
currently i try to undo a
harm_linear2Grid()
. there i do not need to specify a method, and it usesmethod='time'
under the hood, but i dont get the deharmonization run properly, as i dont know the correct value to pass tomethod=??
..'inverse_time'
? do i miss something😕 ?this is a very big merge request.. i suggest to not add more here (except fixes), and do further work on a new branch forked from this, because its gets harder and harder to review
Edited by Bert Palm'inverse_time'
? do i miss somethingthe key word to pass would be "inverse_nagg" - that is a special case for interpolations. I will add an "inverse interpolation" keyword that maps onto "inverse_nagg". For the other harmonization methods i think the inversion method to choose is quite clear - since they have method keywords that can just be prefixed by "inverse_" to get the inversion method.
The harmonization - deharmonization concept isn't very clear to me.
Anything else unclear? Maybe we can have a VC talk on that. Or a real live talk as well
i suggest to not add more here (except fixes), and do further work on a new branch forked from this
Okidoki - but it makes me nervous to work on a somehow unauthorized branch and also having to merge develop repeatedly and solving merge conflicts again and again gets kind of annoying on the long run - so we maybe could meet and go through the request togethter, to bring this thing to an end soon?
Edited by Peter Lünenschloß
- Resolved by Peter Lünenschloß