“Pushing the code to the data”
A fundamental issue with big data is …. well … the size of the data.
To run computation of large data sets requires us to bring the code and the data together. We have essentially two mechanisms
1. Pull the data to code
2. Push the code to the data.
Pulling millions of records across a network and loading into another program for processing can simply take too long. Pushing the code to the data and processing it locally is faster and usually an order of magnitude faster.
Modern systems exist to do this very effectively. eg Snowflake and Databricks both allow the running of procedures written in Python to run close to the data. Snowflakes Snowpark and panda allows the local running of code with the pandas code pushes to the data.
Idempotency and Temporal Data
Idempotency is a critical feature in a modern Data Platform. Idempotency is the ability to run processes and guarantee the same results each time. In financial organisations, this can be a legal requirement. Implementation of idempotency is difficult as all or most data will need to be stored temporally, either uni-temporally or bi-temporally. It is very common for developers to under-estimate storing data temporally, especially if bi-temporal data is used.
There and many design patterns available with varying degrees of complexity. The pattern used will vary according to your use-case, your computer power and the required response times.
For slow moving changed, the most common mechanism is to store data uni-temporally with a start date-time and end date-time for each record. This works well where the “past” can not the changed. Changing the past becomes complicated, as a correction may cause historic records to be split and new “filler” records to be created. In these cases, storing the data bi-temporally can be an option. It simplifies, the storing of data but complicates the selection of data.
For data changing daily, a common mechanism is to store the data with just a date and to increment a version number each load and time loaded. This mechanism has the advantage of being a bi-temporal store.
