Technology Blog Posts by Members
cancel
Showing results for 
Search instead for 
Did you mean: 
saravananc2
Explorer
8,385

Understanding Data Flows in SAP Datasphere

Introduction

In SAP Datasphere, flows are a key component for implementing ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes. We can think of Data flow as traditional ETL pipelines whereas Replication flow aligns more closely with modern data engineering practices where raw data is replicated or ingested into the data lakes and then transformation logic is applied afterwards which aligns with transformation flow. Below are the three types of flows available.

  1. Data Flow
  2. Replication Flow
  3. Transformation Flow

In this blog post, we’ll focus on the end to end functionalities in Data Flow — covering the below areas.

  1. What is Data Flow?
  2. Why Use a Data Flow?
  3. Data flow Comparison with Replication and Transformation flows.
  4. Data load modes for target tables.
  5. How to create a Data Flow?
  6. How to use Custom Python script in Data Flow?
  7. How to use input parameters in Data Flow?
  8. How to schedule the Data Flow?

What Is a Data Flow in SAP Datasphere?

A Data Flow in SAP Datasphere is used to integrate and transform data from both SAP and non-SAP sources, and move it between various targets within the Datasphere environment.

It supports:

- Standard transformation operations: filter, join, union, aggregation

- Advanced operations using Python scripts

- Batch processing workloads

It’s comparable to SAP BODS or Process Chains in SAP BW.

Source Compatibility:

Data Flows are supported for most sources currently integrated with Datasphere, including:

- SAP S/4HANA CDS views

- SLT (SAP Landscape Transformation)

- Non-SAP sources like Amazon Redshift, Google BigQuery, and Azure Data Lake

However, they do not support:

- ODP-based data sources (e.g., classic BW extractors)

- SAP BW/4HANA 

- Amazon Athena

For delta extraction from S/4HANA, you must use delta-enabled CDS views.

Why Use a Data Flow?

- To bring data into SAP Datasphere from various sources

- To perform complex transformations (e.g., business logic, calculations)

- To persist intermediate or final datasets within different layers (e.g., raw, harmonized, consumption)

Data Flow vs Replication Flow vs Transformation Flow

Here is a quick comparison between 3 flows based on their use cases.

Flow Type

Primary Use Case

Transformation

Notes

Data Flow

ETL

Yes

Transform and load

Replication Flow

ELT

No

Just extracts and loads

Transformation Flow

Post-load transforms

Yes

Works on already loaded data

Data Load Modes for Target Tables

When loading data to a target table in a Data Flow, you can choose from three update modes:

 

1.jpg

1. Append

- Inserts all incoming records as new rows

- Does not affect existing data

- Use Case: Incremental/historical loads

2. Truncate

- Deletes all existing records from the table before loading new data

- Use Case: Full refresh scenarios

3. Delete

- Deletes only records in the target that match incoming records (based on keys/conditions)

- Use Case: Targeted clean-up before loading updated records.

How to Create a Data Flow

Example: Persist Sales Transactions by combining them with Product and Store details.

Steps:

1. Go to Data Builder → Select Flows.

01.png

 

 

 

 

 

 

 

 

 

 

2.Click New Data Flow

2.jpg

 

3. In the canvas:

    - From the HANA_CLOUD connection → navigate to the TECHED folder

    - Drag and drop Sales_Transactions123, Product, and Stores tables

4.png

5.png

 

4. Perform the joins:

    - Join Sales_Transactions with Stores on Store_ID

    - Join Sales_Transactions with Product on Product_ID

       4a.  Add the target table:

    - Click the output node → give a name

    - Choose “Create and Deploy Table”

5. Save and Deploy the data flow

9.png

31.png

6. Click Run to execute the flow

32.png

 

7. We can use the schedule option to run it on predefined intervals and monitor the execution using Data Integration Monitor or Data Flow Monitor.

Custom Python Step in a Data Flow

For simple filters and aggregation we can use the filter and calculated columns functionalities available graphically. For more advanced complex data transformations involving conditional logic, string parsing, KPI buckets we can make use of script operator. 

Let’s embed a Simple Python script inside the above Data Flow to calculate few KPI’s.

Steps:

1. I have copied the above data flow and created a new one with script operator. Insert the script operator after J_PRODUCT node.

18.png     

2. Edit the Script Node and paste the below code:

Note: We need not use import pandas in Datasphere. It's already available internally.

def transform(data):

    df = data.copy()

 

    epsilon = 1e-9

 

    df['profit_margin'] = df['Profit'] / (df['Revenue'] + epsilon) * 100

    df['discount_rate'] = df['Discount'] / (df['Revenue'] + epsilon) * 100

    df['cost_to_revenue'] = df['Cost'] / (df['Revenue'] + epsilon) * 100

 

    df['high_discount_flag'] = df['discount_rate'] > 20

 

    df['profit_category'] = df['profit_margin'].apply(

        lambda x: 'High' if x > 30 else ('Medium' if x > 10 else 'Low')

    )

    return df

```

19.png

3. Create new columns for all the above calculated values. 

20.png

4. Add the target table. Create and deploy the table.

21.png

5. Save and deploy the data flow.

6. Execute the flow — you’ll see the KPI’s populated in the output.

22.png

Input Parameters in Data Flow

We can create input parameters to dynamically populate the values entered by the user which can be used in filters and calculated columns.

1. Click on blank space in the canvas editor and click on input parameters from the details pane.

23.png

2. Create a new input parameter IP_STORE_ID with type string.

24.png

 

3. I have inserted a projection node between the J_PRODUCT node and script node. We can create filters, calculated columns in projection node. I have defined the below filter condition that uses the input parameter we just created. 

25.png

 

4. Save and deploy the data flow.  Once you run the dataflow it will prompt to enter the input parameter.

26.png

 

5. It’s running with ST9 as store ID.

27.png

 

Schedule

Schedule is used to run the dataflow on a recurrent basis similar to process chain batch loads. To run recurring scheduled tasks on your behalf, we need to authorize the job scheduling component.We can also authorize from your profile settings page.

1. Click on Create from Schedule drop down option. Click on Authorize.

 

28.png

 

2. I have scheduled it to run at 12.30 PM UTC.

29.png

 3.  In the Data integration monitor we can see it is scheduled to run every day.

30.png

Conclusion

This blog introduced the Data Flow feature in SAP Datasphere, highlights how it differs from other flow types, and explains key concepts such as update modes. It also walks through both standard and Python-based examples, covering input parameters and scheduling options.

2 Comments