Data Platform Components

Data Platforms come in different forms according to the use-case. A good data platform will be componentised to allow each part to scale according to the data volumes and processing demands made on them.

The list of components used will vary according to type of Data Platform.

Storage Layer

This is the most important layer in a Data Platform and the only layer common to all Platforms.
Without a Storage Layer there can be no Data Platform.

Storage can range from a simple file system to a document store database to a distributed database.

Orchestration

This is the system that manages, schedules, and monitors the processes and the flow of data through the pipelines. It ensuring tasks (ingestion, transformation, Data Quality) run in the correct sequence and acts as the dependency manager between tasks. It also handles errors providing operational observability.

Ingestion

The ingestion layer is the foundational component responsible for importing the raw data from sources into a central storage area (a data lake or lakehouse). The Ingestion Layer enables any form of loading (batch, real-time, or streaming) which will vary according to the use-case.

Transformation

The transformation layer is the functional stage where raw, ingested data is cleaned, restructured, enriched and combined. It’s the bridge between raw storage and the final consumption layer.

Data Quality

The Data Quality (DQ) component that sits between raw data ingestion and consumption to ensure that data is accurate, complete, consistent, and fit for purpose. It can act as a set of gate between consumers or as a “decorator” of the data that can be read by down stream systems.

Data Governance – Lineage, Dictionary

Data Governance includes a range of subjects. It covers

  • Data Cataloging, Dictionary and Discovey
  • Data Lineage
  • Data Ownership and Stewardship

Data Access (Consumption)

Data Access is a key different between a Platform and other data systems.
Traditional EDMs think in terms of pushing data to downstream systems via files or messages, whereas Data Platforms focus on allowing data to be accessed in situ. A Data Platform therefore requires the processing power to service all these request and ultimately scale up as required.
There is nothing wrong with a Data Platform distributing data via the traditional means but if all data is accessed this way – the Data Platform is more of an EDM.

Access to the data should always be done through an access layer so as to allows the engineers to re-model the data without impacting downstream systems. The access layer could be a simple “pass-through” view or a implementation of a transformation layer to implement a “schema on read” mechnaism.

Computation Layer

The ability to compute within the Data Platform is one of the fundamental differences between an EDM and a Data Platform. An EDM is viewed as a single purpose program. A Data Platform allows actors to “push the code to the data”. The code should not be viewed as part of the Data Platform, merely that the Data Platform is providing CPU power to users.

The advantage of “pushing the code to the data” is that the data is multiple times larger than the code.

Semantic Layer

A semantic layer is a non technical, business orientated abstraction layer that sits between data sources and the end-user tools (eg BI dashboards, AI agents). It acts as a “universal translator,” mapping technical data structures—such as table names and SQL queries—into familiar business concepts.