
Before I set ground, let me talk about the two big blocks in Data Analytics of modern ages, Data Streaming and Data Lakes or LakeHouses
image by Author
The Lakehouse is a relatively new concept that merges the capabilities of a data lake and a data warehouse. Meanwhile, data streaming serves as the foundation for real-time processing and ensures data consistency between real-time and non-real-time systems.
A Lakehouse is typically used for analytical use cases, whereas data streaming supports both operational and analytical applications, bridging the gap between them.
Data streaming is built on event-driven architecture, a concept that has existed for over 20 years with message brokers. However, data streaming introduces fundamental differences. I usually describe it using four pillars:
Real-Time Messaging
Persistence Layer
Data Integration
Data Processing
Rather than choosing one over the other, the best strategy is to combine them:
Nevertheless, the key it results in using open table formats, with Apache Iceberg, we can store data once and query it with any tool and this is what drives this blog.
While Kafka excels in delivering real-time data latency, Iceberg or Delta Lake provide a significant advantage: it does not require server-side compute resources in order to access data, as it stores the data in Parquet files within object storage.
As I described in this blog post, organizations should set Iceberg or Delta Lake as the default publisher location for their data products, and only opt for streaming (Kafka) when sub-second latency is mandatory.
More on Data Products later in the blog, now a new Data Processing architecture called Shift Left has emerged as a response to inefficiencies in earlier data processing paradigms Medallion or ETL. The main problem of the Medallion and ETL architectures is its inefficient and has an elevated compute cost because data is copied over and over again, and every time.
In Multi-Hop Medallion Architecture, raw data ("bronze") undergoes iterative refinement through silver (cleaned) and gold (enriched) layers via batch ETL processes. This required duplicative processing at each layer, increasing compute costs, latency between data generation and availability and silos where operational/analytical systems used different datasets.
The Multi-Hop Architecture
Data is typically extracted, transformed, and loaded (ETL) into an analytics environment following a multi-hop process:
Medallion architecture by Databricks
This Medallion Architecture (Bronze → Silver → Gold) is widely used but has significant drawbacks.
Slow Data Processing
High Costs
Brittle Pipelines
Duplicate Pipelines
Similar Yet Different Datasets
Shift Left addresses these inefficiencies by moving data preparation earlier in the process.
Shift Left Architecture. By Author
The key principles include:
A Stream-First Approach
Stream-to-Table Conversion with Amazon S3
Integration with Existing Analytics Workflows
The Shift Left Architecture builds on the concept introduced by McKinsey consultants called data products, which are central to modern **data mesh** principles. The goal is to unify operational and analytical workloads by creating high-quality, consistent, and real-time data products.
source; McKinsey "a-better-way-to-put-your-data-to-work"
The Shift Left Approach addresses the medallion inefficiencies by moving data processing closer to its source—on the left side of the architecture. At its core, this approach relies on event-driven data streaming for real-time, scalable, and reliable processing.
The concept is to create the data product once, and that will be available for multiple systems to consume. This is cheaper, faster and supports both Operational and Analytical use cases with Iceberg format and ACID transactions guarantee.
Amazon S3 with its Iceberg capability, as a table format to unify operational and analytical workloads, allows businesses to store data once (e.g., in an object store Amazon S3) and consume it across various platforms (e.g., Snowflake, Athena) without requiring additional processing or connectors, but other cloud Table formats like Delta Lake could be served.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
10 | |
7 | |
7 | |
7 | |
6 | |
5 | |
5 | |
4 | |
4 | |
4 |