dios - hide the data access within in a 'data source' layer
When the program hits the main loop (the function runner
) all data needs to be available in main memory, what increases the chance of potential MemoryError
s. I think, we could reduce that risk considerably if we would hide the data
-DataFrame
within in an abstraction layer, that allows to pull out single/multiple variable(s). I am thinking of an interface like this:
class AbstractDataSource():
def getData(self, variables):
pass
How the getData
method works is than suspect to the concrete implementations. A simple implementation providing the same functionality could look like this:
class SimpleCSVDataSource(AbstractDataSource):
def __init__(self, file_name):
self._data = pd.read_csv(file_name)
def getData(self, variables):
return self._data.loc[variables]
a more elaborate version could be:
class DMPRESTDataSource(AbstractDataSource):
def __init__(self, credentials):
self._credentials = credentials
def getData(self, variables):
# something in the lines of
response = requests.get("endpoint_url", params={"variables": variables}, auth: self._credentials)
return pd.DataFrame(data=response.body)
Going down that route, would allow us to read data only when needed, would open up the possibility to implement ideas like the harmonization of differing timeseries at a single position and even run them implicitly. The major downside in my opinion is, that we add another layer of abstraction and probably thereby reduce the extendability of the system. This is definitely not for rdm/saqc%"1.0" , but maybe something to consider afterwards.