Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
MarioDeFelipe
Active Contributor
696

Before I set ground, let me talk about the two big blocks in Data Analytics of modern ages, Data Streaming and Data Lakes or LakeHouses

image by Authorimage by Author

 

 

The Lakehouse is a relatively new concept that merges the capabilities of a data lake and a data warehouse. Meanwhile, data streaming serves as the foundation for real-time processing and ensures data consistency between real-time and non-real-time systems.

A Lakehouse is typically used for analytical use cases, whereas data streaming supports both operational and analytical applications, bridging the gap between them.

 

Data streaming is built on event-driven architecture, a concept that has existed for over 20 years with message brokers. However, data streaming introduces fundamental differences. I usually describe it using four pillars:

  1. Real-Time Messaging

    • Similar to traditional message brokers but built to scale massively—processing millions of messages per second.
    • Supports transactional workloads with features like exactly-once processing, which is crucial for applications like payment processing.
  2. Persistence Layer

    • Unlike traditional message brokers, data streaming persists events.
    • This allows businesses to define retention policies—from a few hours (for logs or IoT data) to months or years (for critical business data).
    • This prevents the need to repeatedly query the ERP systems.
  3. Data Integration

    • Unlike message brokers, which require additional ETL tools, data streaming platforms offer built-in integration.
    • They connect monolith systems (relational databases, ERPs, mainframes) and cloud-native services (object stores, cloud Lakehouses, SaaS applications).
  4. Data Processing

    • Stream processing enables continuous transformation and enrichment of data.
    • Instead of storing and analyzing raw data later, it can be aggregated, filtered, and enriched before reaching its destination.
    • Used for fraud detection, predictive maintenance, customer recommendations, and even real-time AI inference.

Data Streaming Strengths:

  • Best for processing data in motion with low-latency event-driven architectures.
  • Used in real-time analytics, fraud detection, monitoring, alerting, and transactional processing.
  • Core banking systems, ERP solutions, and AI inference engines often use Kafka and Flink under the hood.
  • Ideal for continuous AI model inference where immediate action is required.

Lakehouse Strengths:

  • Originates from data warehouse and data lake concepts (e.g., Snowflake, Delta Lake).
  • Best for batch processing, historical data analysis, and business intelligence.
  • Used for long-term data storage, reporting, and model training.
  • AI workloads like model training and large-scale batch inference fit well in a Lakehouse.

Rather than choosing one over the other, the best strategy is to combine them:

  • Data streaming processes data in motion, enabling real-time decision-making.
  • Lakehouses store processed data at rest, making it available for historical analysis and AI training.
  • Streaming data can be persisted in a Lakehouse for further analytics.

Nevertheless, the key it results in using open table formats, with Apache Iceberg, we can store data once and query it with any tool and this is what drives this blog.

While Kafka excels in delivering real-time data latency, Iceberg or Delta Lake provide a significant advantage: it does not require server-side compute resources in order to access data, as it stores the data in Parquet files within object storage.

As I described in this blog post, organizations should set Iceberg or Delta Lake as the default publisher location for their data products, and only opt for streaming (Kafka) when sub-second latency is mandatory.

More on Data Products later in the blog, now a new Data Processing architecture called Shift Left has emerged as a response to inefficiencies in earlier data processing paradigms Medallion or ETL. The main problem of the Medallion and ETL architectures is its inefficient and has an elevated compute cost because data is copied over and over again, and every time.

In Multi-Hop Medallion Architecture, raw data ("bronze") undergoes iterative refinement through silver (cleaned) and gold (enriched) layers via batch ETL processes. This required duplicative processing at each layer, increasing compute costs, latency between data generation and availability and silos where operational/analytical systems used different datasets.

The Multi-Hop Architecture
Data is typically extracted, transformed, and loaded (ETL) into an analytics environment following a multi-hop process:

  1. Bronze Layer (Landing Zone) – Raw data from operational systems, including Kafka topics, SaaS endpoints, or text files, is ingested here.
  2. Silver Layer (Refined Data) – Data is cleaned, structured, and enriched, making it usable for business purposes.
  3. Gold Layer (Business-Ready Data) – Purpose-built datasets are created for specific applications, reports, or exports.

Medallion architecture by DatabricksMedallion architecture by Databricks

 

This Medallion Architecture (Bronze → Silver → Gold) is widely used but has significant drawbacks.

Challenges with Multi-Hop Architectures

  1. Slow Data Processing

    • Most multi-hop architectures rely on scheduled batch processing.
    • Each transformation step introduces a delay (e.g., a 15-minute interval per hop), resulting in outdated insights.
  2. High Costs

    • Every hop requires additional storage and processing power, increasing infrastructure expenses.
  3. Brittle Pipelines

    • Different teams manage different layers (e.g., database engineers, data engineers, and analysts), leading to coordination challenges and frequent breakages.
  4. Duplicate Pipelines

    • To avoid disruptions, teams often create their own redundant data pipelines, adding unnecessary complexity.
  5. Similar Yet Different Datasets

    • Multiple versions of the same dataset emerge, leading to confusion about which one is the authoritative source.
    • Inconsistent data can erode trust, especially when reports show conflicting numbers.

 

Shift Left addresses these inefficiencies by moving data preparation earlier in the process.

Shift Left Architecture. By AuthorShift Left Architecture. By Author

 

 

The key principles include:

  1. A Stream-First Approach

    • Instead of periodic batch jobs, data is processed in real-time using event-driven architectures (EDAs).
    • Technologies like Kafka and Change Data Capture (CDC) connectors ensure that updates are immediately available.
  2. Stream-to-Table Conversion with Amazon S3

    • Amazon now automatically converts streams into Apache Iceberg tables, making structured data instantly accessible.
    • Iceberg tables can be read by multiple engines (e.g., Trino, Flink, Presto, Spark, Databricks, and Snowflake).
  3. Integration with Existing Analytics Workflows

    • Shift Left doesn’t replace existing architectures but enhances them.
    • Cleaned and structured data from streams seamlessly integrates into the Silver layer, enabling faster and more reliable insights.

Benefits of Shift Left

  • Faster Data Availability – Real-time streaming eliminates batch processing delays.
  • Lower Costs – Fewer data copies reduce storage and processing expenses.
  • Improved Reliability – Stronger schema enforcement prevents pipeline breakages.
  • Simpler Architecture – Eliminates redundant pipelines and conflicting datasets.
  • Incremental Adoption – Companies can gradually migrate workloads to Shift Left without disrupting operations.

 

The Shift Left Architecture builds on the concept introduced by McKinsey consultants called data products, which are central to modern **data mesh** principles. The goal is to unify operational and  analytical workloads by creating high-quality, consistent, and real-time data products.

source; McKinsey "a-better-way-to-put-your-data-to-work"source; McKinsey "a-better-way-to-put-your-data-to-work"

 

The Shift Left Approach addresses the medallion inefficiencies by moving data processing closer to its source—on the left side of the architecture. At its core, this approach relies on event-driven data streaming for real-time, scalable, and reliable processing.

The concept is to create the data product once, and that will be available for multiple systems to consume. This is cheaper, faster and supports both Operational and Analytical use cases with Iceberg format and ACID transactions guarantee.

Amazon S3 with its Iceberg capability, as a table format to unify operational and analytical workloads, allows businesses to store data once (e.g., in an object store Amazon S3) and consume it across various platforms (e.g., Snowflake, Athena) without requiring additional processing or connectors, but other cloud Table formats like Delta Lake could be served.

1 Comment
ChandraPrakash1
Discoverer
0 Kudos

Insightful! Great Job!

Labels in this area