Technology Blog Posts by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
MarioDeFelipe
Active Contributor
1,542

Words are cheap. I will be crisp.

SAP is a goldmine of data. The classic old Data Engineering challenge has been unifying that complex, scattered business data – especially from SAP applications like ECC and S/4HANA – with other enterprise data, and then truly leveraging it for advanced analytics, and for some reasons (like MRP or Forecasting), Machine Learning.

The new challenge is to do the same, but for cutting-edge generative AI. To tackle this, SAP introduced early this year the SAP Business Data Cloud (BDC).

BDC has 3 main concepts and 3 main SAP products. The 3 main components are the Data Products, the Insight Apps and the BDC itself. And the products that make it possible are the classics SAP Datasphere, SAP Analytics Cloud (which have been around for a while) and SAP Databricks, which is an optional component of BDC and was in preview... until today.

Now that its GA; better get understand what it is, and what it's not.

What SAP Databricks IS: The union between SAP Business Data with Generative AI

Do not confuse Databricks with SAP Databricks. Very similar, but different. What SAP Databricks is;

  • A Native, Fully Managed Service within SAP Business Data Cloud: This isn't an external solution bolted on. SAP Databricks is a version of the Databricks platform included natively as a service within SAP BDC. It is a fully managed service, meaning SAP handles the underlying infrastructure, allowing us to focus on our data and AI initiatives.
  • Unifying Semantically Rich Business Data with Leading AI/ML: The core power lies in unifying our valuable, semantically rich data from SAP applications with the current industry-leading AI, machine learning, data science, and data engineering platform. SAP achieves the goal to integrate SAP data with the rest of the business data for the reasons why Databricks is booming. Its advanced analytics and AI use cases.
  • Zero-Copy Data Sharing Powered by Delta Sharing: SAP moves away from data replication. BDC is an end destination and SAP Databricks leverages zero-copy Delta Sharing. This enables seamless, bidirectional data sharing between SAP BDC and Databricks, so data does not need to leave the SAP platform to use it for Generative AI use cases. This is crucial for integrating data products from SAP applications with semi-structured and unstructured data from any source, without needing to move the data physically.
  • Custom Data Products: Within SAP Databricks, users can easily access SAP data products and create custom data products out of their AI/ML and analytics workloads. Data products are fundamental building blocks within BDC which I explained in this blog post, serving as packaged and governed data assets.

What Databricks tools are available in SAP Databricks.

  • Databricks Notebooks: Ideal for pro-code data engineering tasks and building custom AI/ML models.
  • Databricks SQL: Allows us to analyze datasets at scale using standard SQL queries.
  • Unity Catalog: Provides centralized governance for all our data assets, including structured and unstructured data, ML models, notebooks, and files. It enables governance for data products exposed from SAP Databricks via the SAP BDC catalog.
  • Mosaic AI: This is our gateway to building secure, governed, and custom AI/ML solutions, including advanced Generative AI applications and Large Language Models. I'll discuss this below.
  • Serverless Spark: Provides scalable compute resources for our data processing tasks.

SAP Databricks and Databricks Feature Parity

The following table summarizes the availability status of key Databricks components in Standard Databricks versus SAP Databricks, based on the knowledge I could gather. Subject to be corrected.

 

Feature/Component Category

Feature/Component Name

SAP Databricks

Core Platform

Workspace UI

Yes (Through BDC)

 

REST APIs

Likely Yes (Potentially subset or different endpoints/authentication)

 

CLI

Unconfirmed

Compute

All-Purpose Clusters (Classic)

Unconfirmed (irrelevant in SaaS)

 

Jobs Compute (Classic)

Unconfirmed (irrelevant in SaaS)

 

Instance Pools

Unconfirmed (irrelevant in SaaS)

 

Serverless SQL Warehouses

Unconfirmed (irrelevant in SaaS)

 

Serverless Notebooks/Jobs Compute

Unconfirmed (irrelevant in SaaS)

Data Management

Delta Lake

Yes 

 

DBFS (Databricks File System)

Unconfirmed (irrelevant in SaaS)

 

Volumes (for non-tabular data)

Unconfirmed (irrelevant in SaaS)

Governance

Unity Catalog

Yes

 

Data Lineage (via UC)

Yes

Data Sharing

Delta Sharing

Yes* (Between BDC and SAP Databricks)

Data Engineering

Notebooks

Yes

 

DLT (Delta Live Tables)

Excluded

 

Auto Loader

Excluded

 

Jobs / Workflows

Limited (excluded ETL or ML Inference process)

 

Streaming Tables

Excluded

ML/AI (Mosaic AI)

MLflow (Managed)

Yes

 

Model Registry (via MLflow)

Yes

 

Model Serving (Mosaic AI)

Yes

 

Feature Store

Yes

 

Vector Search (Mosaic AI)

Yes

 

AutoML

Yes (Including forecasting for SAP)

 

Agent Framework / Evaluation

Yes

 

AI Functions

Yes

 

AI Playground

Yes

BI & Visualization

SQL Editor

Yes

 

Databricks SQL Dashboards

Yes

 

Redash Integration

Unconfirmed (Likely Excluded)

Ecosystem & Extensibility

Marketplace

Excluded

 

Partner Connect

Excluded

 

Databricks Apps

Excluded

 

Lakeflow Connect

Excluded

Security

IP Access Lists

Yes (With caveats regarding BDC connection)

 

PrivateLink / Private Endpoints

Yes (AWS PrivateLink for us-west-1)

 

Serverless Egress Control

Yes (With caveats regarding BDC connection)

 

The Generative AI with Mosaic AI

Databricks is generally about Spark, but to me, the most innovative piece of what Databricks has to offer is Mosaic AI. Mosaic was an acquisition, to fuel the Gen AI stack on top of the Analytics. Mosaic AI allows us to:

  • Build Modern Generative AI Solutions: Create cutting-edge GenAI applications directly on the Databricks platform, integrated with our business data.
  • Leverage the Full ML Lifecycle: Build an end-to-end GenAI solution using Databricks' infrastructure, data, and tools, covering everything from data preparation and management (like building knowledge bases for RAG) to model training, deployment, securing, and monitoring. This allows for greater control and potentially lower costs compared to relying solely on external, black-box GenAI solutions [implied]. Databricks' AI Agent Framework, powered by Mosaic AI, even allows building domain-specific AI agents that can call external APIs.
  • Ground LLMs with our Business Data: Combine the power of LLMs with any organization's specific, governed business data to create accurate and relevant AI applications. This is a key pattern, often using Retrieval Augmented Generation (RAG) techniques, which Mosaic AI supports.

I know this might bring questions from the SAP community I am yet not ready to answer.

How to get official knowledge on Databricks and Mosaic AI

SAP will be really detailed as usual to share how the Compute, the Data Catalog and the Delta Share works, because SAP Databricks brings a different UI compared to Databricks, it limits the Data Integration to external sources or destinations, and does not allow to deploy the infrastructure on prem or at the hyperscscaler of choice, but the rest is the same, and getting the best of Mosaic AI is on us.

If you have not used it, once you're comfortable with the basics, its key to understand how Databricks approaches Machine Learning lifecycle and how this is transfered to Generative AI;

Remember, Mosaic was acquired in 2023 and might look different since its specific for Generative AI.

  • Key Concepts:

    • MLflow: Understand how Databricks integrates this open-source platform for managing the ML lifecycle (tracking experiments, packaging code, registering and deploying models).
    • Feature Store: How to create, manage, and serve features for model training.
    • Model Training: Using libraries like scikit-learn, TensorFlow, PyTorch within Databricks notebooks.
    • Model Registry: Storing and versioning trained models (often within Unity Catalog).
    • Basic Model Serving: Understanding the concept of deploying models as APIs (aka Mosaic AI Model Serving).

There are too many concepts so lets summarize a Cheat Sheet

Mosaic AI Cheat Sheet

Feature/Concept

Description

Key Details / How it Relates

Mosaic AI (Databricks)

Databricks' brand for its generative AI offerings and a complete suite for AI and ML

Integrated technology from the MosaicML acquisition. Provides a well-structured AI framework ensuring governance, inference, orchestration, and data management. Enables creating a complete Gen AI solution using its own infrastructure, data, and tools  All-in-one platform for building, managing, and scaling GenAI applications. Offers complete ownership over models and data.

Mosaic AI Agent Framework

A set of tools on Databricks designed to help developers build, deploy, and evaluate production-quality AI agents

Supports building various agent types like RAG applications, Text-to-SQL agents, data analyst agents, customer support, research, business operations, advisory agents, and more.  Part of the Orchestration Layer in the Mosaic AI architecture.  Integrated with MLflow and compatible with LangChain/LangGraph and LlamaIndex.  Includes robust agent evaluation. Key functions include intent recognition, dialogue management, context tracking, knowledge augmentation, fact-checking, memory management, tools & planning. Includes Agent SDK, Agent evaluation, Agent serving, and a Tools Catalog. Leverages Unity Catalog, Model Serving, and AI Playground. Designed for rapid experimentation and deployment with control over data sources.

Mosaic AI Vector Search

An integrated vector database and search capability within Databricks

Stores vector embeddings for fast similarity searches. Crucial component for building RAG applications. Index is built based on Delta tables. Can use Databricks foundation models for embeddings. An index is automatically synchronized with the source Delta tables. Supports similarity search via the Vector Search Endpoint. Part of the Storage & Knowledge Base Layer  Scalable to handle billions of embeddings and thousands of queries per second. Allows building a "Vector Database" internally without external options.  Requires preparing a Delta table knowledge base. Supports incremental auto-sync verification. 

Mosaic AI Model Training

Tools for fine-tuning or pre-training large language models

Integrated from the MosaicML acquisition. Part of the model lifecycle. Leverages scalable compute resources for demanding algorithms. Allows training custom LLMs at a lower cost. Requires selecting a base model (e.g., Llama, DBRX, etc.), specifying the type of task (code completion, instruction fine-tuning, continued pre-training), and providing the location of the dataset. Part of Mosaic AI capabilities. 

Mosaic AI Batch Inference

Enables running batch LLM inference at high scale

Uses the ai_query function to apply AI directly on data  Leverages Databricks Model Serving for speed, autoscaling, batching, and fault tolerance. Used for tasks like summarization, sentiment analysis, and topic analysis on transcripts. Can use models like OpenAI Whisper V or open-source models deployed on Databricks. 

Mosaic AI Gateway

Provides a unified interface and governance layer for querying and managing access to various AI model APIs (internal and external)

Part of the Governance Layer in the Mosaic AI architecture. Ensures compliance, security, and monitoring of AI requests. Includes features for Permissions & Rate Limiting, Payload Logging (for auditing), Usage Tracking (for performance/engagement), AI Guardrails & Traffic Routing, and Content Filtration & PII Detection. Builds on Model Serving. Unified interface to query foundation model APIs and audit usage  Supports any model endpoint (Azure OpenAI, Amazon Bedrock, Meta Llama, etc.). Logs usage in Unity Catalog with permissions enforced. Provides real-time spending insights for cost management. Safeguards data and GenAI deployments with centralized governance. Ensures responsible AI deployment. Content safety and quality checks (related to guardrails) are also part of monitoring. 

Mosaic AI Model Serving

Component responsible for running AI models and serving responses in real-time or batch mode

Part of the Inference Layer in the Mosaic AI architecture. Provides a Unified Deployment Interface. Offers Scalable Model Deployment and Automatic Resource Scaling  Supports Model/Agent Endpoints  Agents registered in Unity Catalog can be deployed as Model Serving endpoints  Integrated with the Agent Framework. Enables real-time similarity search against the Vector Search index. Can serve custom LLMs. Used for Batch Inference via ai_query. Part of Mosaic AI capabilities. Enables seamless model deployment  Supports real-time serving & batch inference  

Mosaic AI Agent Evaluation

Tools and processes within the Mosaic AI Agent Framework for assessing the quality, reliability, and safety of AI agents

Focuses on robust evaluation through human feedback loops, cost/latency trade-offs, and quality metrics  Includes LLM Judge and Agent Metric on parameters like chunk relevance, ground truth, query relevance, response safety, latency, and token count. Agent performance is evaluated against a ground truth dataset. Uses the MLflow Evaluate API. Allows iterative evaluation by modifying datasets and rerunning. Enables logging of interactions (questions, answers, feedback) for performance analysis. Supports using LLMs or human judges (internal/external) for evaluation. Ensures quality, reliability, and safety  Human-in-the-loop via Review App is supported  Separate documentation is available. Part of the Agent Framework components. 

Mosaic AI Tool Catalog (in UC)

A registry within Databricks, managed in Unity Catalog, for registering functions (SQL, Python, remote) and model endpoints that AI agents can call

Part of the Mosaic AI Agent Framework. Supports Tools & Planning functionality. Helps clients create a registry of SQL, Python, or remote functions, and model endpoints. Agents can use tools from the catalog to interact with data sources (Delta Tables) or external systems/APIs.  Use Unity Catalog Functions (uc_functions) as tools. Enables tool calling. Managed and registered in Unity Catalog.

RAG (Retrieval-Augmented Generation)

An architecture pattern for GenAI systems that uses retrieval of relevant data to augment model responses

Mosaic AI Vector Search is a component for building RAG applications on Databricks. Involves indexing documents/data, converting them to embeddings, and performing similarity searches to retrieve relevant information for the LLM. The Mosaic AI Agent Framework supports building RAG applications. Advanced RAG techniques exist beyond Naive RAG. Requires a foundation of good data governance and metadata. Uses vector databases. 

AI Functions

Functions, such as SQL, Python, remote, or specific AI capabilities like ai_query, that can be registered and called by AI agents or used for AI tasks on data

Can be registered in the Tools Catalog. Can be Unity Catalog Functions (uc_functions). Enabling agents to interact with data or external systems  ai_query is a specific function for batch LLM inference. Used by agents to perform actions based on prompts. 

Unity Catalog

Databricks' unified governance layer for data, AI models, notebooks, files, and other assets

Provides centralized governance. Vector Search indexes are created and managed within Unity Catalog. The Tools Catalog for the AI Agent Framework is registered in Unity Catalog. AI Agents/Models are registered in Unity Catalog via MLflow. Enables model governance, ensuring only authorized users can access, modify, or deploy models. Logs AI Gateway usage with permissions enforced. Provides Attribute-based access control (ABAC) to all managed assets. Crucial for providing context and ensuring reliability for AI (agents, text-to-SQL) through metadata and governance. Facilitates seamless interoperability and data sharing (e.g., via Delta Sharing). Manages metadata for Delta Lake tables, often the source for Vector Search. Supports lineage tracking. Key component for an AI governance program, converging data, and AI metadata. Essential for access control to prevent data/model mishandling. 

Delta Lake

An open standard table format providing reliability and performance for data lakes

Often the foundational storage format for data used in Databricks . Data tables used as a source for Mosaic AI Vector Search are typically Delta tables.  Provides transaction logs, schema evolution, and time travel. Vector Search indexes are automatically synchronized with updates to source Delta tables.  Used for storing historical data that agents can access. Supports zero-copy data sharing (Delta Sharing).

Foundation Models / External Models

Pre-trained large language models (LLMs) provided or accessible through the Databricks platform or other platforms, via APIs/endpoints.

Databricks provides access to foundational models (dbrx, llama, mixtral). Can use third-party models like OpenAI, Azure OpenAI, Amazon Bedrock, Meta Llama , Mistral, Google, Reka. Accessible via Model Serving Endpoints and potentially through the AI Gateway. Used for various tasks like generating embeddings for Vector Search or performing analysis in Batch Inference via ai_query. Agents can use specific LLM endpoints. Other platforms also integrate or offer foundation/external models (IBM watsonx, Snowflake Cortex)  

AI Agents

Applications designed to perform tasks by leveraging LLMs and external tools/functions

Built using the Mosaic AI Agent Framework. Can call tools (like Unity Catalog functions or external APIs) to retrieve information or perform actions. Can perform tasks like analyzing financial data, answering questions based on internal knowledge bases (RAG), or Text-to-SQL. Require data governance and metadata for accuracy and trustworthiness. Can be evaluated for quality and reliability. Can be deployed as Model Serving endpoints. 

AutoML

Automated Machine Learning.

End-to-end AI/ML lifecycle capability available in the SAP Databricks context.  Aka Mosaic AI AutoML. Part of the collaborative and unified data science environment.

Managed MLflow

Databricks' integrated platform for managing the ML lifecycle

Integrated with the Mosaic AI Agent Framework. Supports experiment tracking, model logging. Enables model registration in Unity Catalog (UC). Used for Agent Evaluation (MLflow Evaluate API). Includes MLflow Tracing (open-source. Supports deployment workflows. Part of end-to-end AI/ML lifecycle capabilities. Supports iterative evaluation and tracking metrics. Captures code and environment for deployment. 

AI Playground

A sandbox environment on Databricks for prototyping and testing AI agents, particularly those using tool-calling

Provides an environment for integrating tools (UC functions) with LLMs for testing. Allows refining agent functionality before deployment. Supports exporting the agent from the playground for further development or deploying it directly as a Model Serving endpoint. Can auto-generate basic agent package notebooks. Leverages AI Playground as part of the Agent Framework benefits. 

Feature Engineering & Serving (in UC)

The process of preparing data features for ML models, integrated within the Databricks platform, potentially leveraging an intelligent feature store

Feature Engineering is a capability within the single Databricks stack. Mosaic AI includes an Intelligent Feature Store for automatic feature engineering and selection. Leverages data governed by Unity Catalog. Features are used by models typically deployed via Model Serving.

Inference Tables

Tables used to log interactions (questions, answers, feedback) from AI agents or models during inference

Primarily used for performance analysis and monitoring. Supports ensuring the quality, reliability, and safety of AI responses. Related to Agent Evaluation monitoring. 

AI Gateway Logging & Usage Tracking

Capabilities within Mosaic AI Gateway for recording user interactions and tracking system performance and user engagement

Part of the Governance Layer. Includes Payload Logging (records user interactions for auditing). Includes Usage Tracking (monitors system performance and user engagement). Enables auditing model usage and data sent/returned. Logs usage in Unity Catalog with permissions enforced. Provides real-time spending insights for cost management.

Agent Evaluation Monitoring UI

User interface components, such as the Review App, support the monitoring, review, and analysis of AI agent performance and interactions

Supports reviewing agent query requests/responses. Allows stakeholders and human judges (internal/external) to give feedback and label responses  Used for iterative evaluation and tracking metrics. Enables analysis of logged interactions. The Review App is a specific mechanism for online evaluation and human-in-the-loop feedback  

Databricks AI/ML Documentation: Tutorials: Get started with AI and machine learning (AWS Example)

 

5 Comments
Labels in this area