Data Platform Components

Data Platforms come in different forms according to the use-case. A good data platform will be componentised to allow each part to scale according to the data volumes and processing demands made on them.

The list of components used will vary according to type of Data Platform.

Storage Layer

The is the most important layer in a Data Platform and the only layer common to all Platforms.
Without a Storage Layer there can be no Data Platform.

Orchestration

This is the system that manages, schedules, and monitors the processes and the flow of data through the pipelines. It ensuring tasks (ingestion, transformation, DQ) run in the correct sequence and acts as the dependency manager between tasks. It also handling errors providing operational observability.

Ingestion

The ingestion layer is the foundational component responsible for importing the raw data from sources into a central storage area (a data lake or lakehouse). The Ingestion Layer enables any form of loading (batch, real-time, or streaming) which will vary according to the use-case.

Transformation

The transformation layer is the functional stage where raw, ingested data is cleaned, restructured, enriched and combined. Its the bridge between raw storage and the final consumption layer

Data Quality

The Data Quality (DQ) component that sits between raw data ingestion and consumption to ensure that data is accurate, complete, consistent, and fit for purpose. It can act as a set of gate between consumers or as a “decorator” of the data that can be read by down stream systems.

Data Governance – Lineage, Dictionary

Data Governance includes a range of subjects. It covers

    • Data Cataloging, Dictionary and Discovey
    • Data Lineage
    • Data Ownership and Stewardship

    Data Dissemination

    Data distribution refers to the process of disseminating data to downstream systems and processes.
    The dissemination process is a key difference between a EDM and a data platform.
    Traditional EDMs think in terms of pushing data to downstream systems where as Data Platforms focus on allowing data to be used in situ.

    Computation Layer

    Th ability to compute within the Data Platform is one of the fundamental differences between an EDM and a Data Platform. An EDM is viewed as a single purpose component. A Data Platform allows actors to “push the code to the data”. The code should not be viewed as part of the Data Platform, merely that the Data Platform is providing CPU power to users.

    The advantage of “pushing the code to the data” is that the data is multiple times larger than the code.

    AI