Perf improvements
Some performance improvements, mostly as discussed in #99 (closed).
This reduced the CLI-runtime for a synthetic dataset with 1000000 rows and 20 columns range tested on every column from ~103 to ~30 seconds as measured with the linux time
-utility.
As the masking is heavily under tested, please thoroughly review these changes @palmb and @luenensc !
Not sure, why the Pipeline fails as it runs on my machine. I have to dig into that...
Merge request reports
Activity
tried your solution. it does not speed up enougth. a simple range test still need >2sec. The main problem is that the dataset has >300 columns. The problem is not the copying, its the access/altering of every single column in masking and unmasking.
i could pass
to_mask=[]
to every test, but this seems quite a bad idea to me, nevertheless a temporary solution.mentioned in issue #99 (closed)
another 0.5 speed improve (so about 0.25 of original ??) could be done with improving unmasking, by constructing the result from the original data and the (masked) data returned by the flagFuntion. only take the written and freshnew columns from the latter, all other columns from the former. See $67
- Resolved by David Schäfer
- Resolved by David Schäfer
Please also update dios to the latest version. I optimized
copy
andcopy_empty
, so we also gain some performance improvement from that.
added 5 commits
- f92fba5d - data is reduced to the fields needed by a test
- c9c717ac - register takes the optional parameter all_data now
- 19347f07 - separeted the masking tests from test_core into new file
- 3b3b4daa - convert numpy arrays to pandas Series as assigning numpy arrays to
- a2f3d555 - Merge branch 'perf_improvements' of https://git.ufz.de/rdm-software/saqc into perf_improvements
Toggle commit listadded 1 commit
- 062db908 - WIP - rework the register and saqcFunc calling machinery
marked as a Work In Progress from 062db908
added 1 commit
- af4051ec - WIP - rework the register and saqcFunc calling machinery
added 2 commits