In SAP Datasphere, flows are a key component for implementing ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes. We can think of Data flow as traditional ETL pipelines whereas Replication flow aligns more closely with modern data engineering practices where raw data is replicated or ingested into the data lakes and then transformation logic is applied afterwards which aligns with transformation flow. Below are the three types of flows available.
In this blog post, we’ll focus on the end to end functionalities in Data Flow — covering the below areas.
A Data Flow in SAP Datasphere is used to integrate and transform data from both SAP and non-SAP sources, and move it between various targets within the Datasphere environment.
It supports:
- Standard transformation operations: filter, join, union, aggregation
- Advanced operations using Python scripts
- Batch processing workloads
It’s comparable to SAP BODS or Process Chains in SAP BW.
Source Compatibility:
Data Flows are supported for most sources currently integrated with Datasphere, including:
- SAP S/4HANA CDS views
- SLT (SAP Landscape Transformation)
- Non-SAP sources like Amazon Redshift, Google BigQuery, and Azure Data Lake
However, they do not support:
- ODP-based data sources (e.g., classic BW extractors)
- SAP BW/4HANA
- Amazon Athena
For delta extraction from S/4HANA, you must use delta-enabled CDS views.
- To bring data into SAP Datasphere from various sources
- To perform complex transformations (e.g., business logic, calculations)
- To persist intermediate or final datasets within different layers (e.g., raw, harmonized, consumption)
Here is a quick comparison between 3 flows based on their use cases.
Flow Type | Primary Use Case | Transformation | Notes |
Data Flow | ETL | Yes | Transform and load |
Replication Flow | ELT | No | Just extracts and loads |
Transformation Flow | Post-load transforms | Yes | Works on already loaded data |
When loading data to a target table in a Data Flow, you can choose from three update modes:
1. Append
- Inserts all incoming records as new rows
- Does not affect existing data
- Use Case: Incremental/historical loads
2. Truncate
- Deletes all existing records from the table before loading new data
- Use Case: Full refresh scenarios
3. Delete
- Deletes only records in the target that match incoming records (based on keys/conditions)
- Use Case: Targeted clean-up before loading updated records.
Example: Persist Sales Transactions by combining them with Product and Store details.
Steps:
1. Go to Data Builder → Select Flows.
2.Click New Data Flow
3. In the canvas:
- From the HANA_CLOUD connection → navigate to the TECHED folder
- Drag and drop Sales_Transactions123, Product, and Stores tables
4. Perform the joins:
- Join Sales_Transactions with Stores on Store_ID
- Join Sales_Transactions with Product on Product_ID
4a. Add the target table:
- Click the output node → give a name
- Choose “Create and Deploy Table”
5. Save and Deploy the data flow
6. Click Run to execute the flow
7. We can use the schedule option to run it on predefined intervals and monitor the execution using Data Integration Monitor or Data Flow Monitor.
For simple filters and aggregation we can use the filter and calculated columns functionalities available graphically. For more advanced complex data transformations involving conditional logic, string parsing, KPI buckets we can make use of script operator.
Let’s embed a Simple Python script inside the above Data Flow to calculate few KPI’s.
Steps:
1. I have copied the above data flow and created a new one with script operator. Insert the script operator after J_PRODUCT node.
2. Edit the Script Node and paste the below code:
Note: We need not use import pandas in Datasphere. It's already available internally.
def transform(data):
df = data.copy()
epsilon = 1e-9
df['profit_margin'] = df['Profit'] / (df['Revenue'] + epsilon) * 100
df['discount_rate'] = df['Discount'] / (df['Revenue'] + epsilon) * 100
df['cost_to_revenue'] = df['Cost'] / (df['Revenue'] + epsilon) * 100
df['high_discount_flag'] = df['discount_rate'] > 20
df['profit_category'] = df['profit_margin'].apply(
lambda x: 'High' if x > 30 else ('Medium' if x > 10 else 'Low')
)
return df
```
3. Create new columns for all the above calculated values.
4. Add the target table. Create and deploy the table.
5. Save and deploy the data flow.
6. Execute the flow — you’ll see the KPI’s populated in the output.
We can create input parameters to dynamically populate the values entered by the user which can be used in filters and calculated columns.
1. Click on blank space in the canvas editor and click on input parameters from the details pane.
2. Create a new input parameter IP_STORE_ID with type string.
3. I have inserted a projection node between the J_PRODUCT node and script node. We can create filters, calculated columns in projection node. I have defined the below filter condition that uses the input parameter we just created.
4. Save and deploy the data flow. Once you run the dataflow it will prompt to enter the input parameter.
5. It’s running with ST9 as store ID.
Schedule is used to run the dataflow on a recurrent basis similar to process chain batch loads. To run recurring scheduled tasks on your behalf, we need to authorize the job scheduling component.We can also authorize from your profile settings page.
1. Click on Create from Schedule drop down option. Click on Authorize.
2. I have scheduled it to run at 12.30 PM UTC.
3. In the Data integration monitor we can see it is scheduled to run every day.
This blog introduced the Data Flow feature in SAP Datasphere, highlights how it differs from other flow types, and explains key concepts such as update modes. It also walks through both standard and Python-based examples, covering input parameters and scheduling options.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 26 | |
| 25 | |
| 21 | |
| 20 | |
| 19 | |
| 14 | |
| 14 | |
| 14 | |
| 14 | |
| 10 |