Hello Friends,
Welcome back to my other serios of blog post on an exciting feature available in SAP Datasphere, i.e., the data flows in SAP Datasphere. In this blog post, we will understand this concept with an example.
First, let us understand the different ETL objects available in SAP Datasphere, and in this blog, we will
mainly discuss dataflows using Python scripts.
Datasphere ETL Objects:
Data Flows
Data flow is a key component of SAP Datasphere as it allows users to perform complex data transformations, enrich data, and structure it for reporting and analytics.
In this blog, we will explore the basic features and steps to create a data flow in SAP Datasphere.
A Data Flow is a graphical ETL (Extract, Transform, Load) tool that allows users to:
✔ Extract data from various sources (SAP and non-SAP).
✔ Apply transformations such as filtering, aggregations, joins, and calculated columns.
✔ Load transformed data into target tables for reporting and analytics.
Remote Tables:
Replication flows:
Transformation flows:
Task chains:
Let us discuss an example about dataflows using a Python script operator. Rest operator examples are written by others in their blog posts.
Now get set and go.
Please understand the different operators provided in SAP Datasphere. Currently, SAP Datasphere does not have a full-fledged ETL capacity, just like SAP BODS does. But in future releases, SAP data intelligence will be added as a feature that can be used in E2E ETL scenarios.
Python Script
Supported Python Libraries as of now in SAP Datasphere:
Nympy and Panda.
https://pandas.pydata.org/docs/user_guide/index.html#user-guide
The Python script operator area is not as mature as the Jupyter notebook and is not so user-friendly. If you face any syntax error, it is a little tricky to find out the error unless and until we execute the dataflow to see the exact error in the data integration monitor. Hope SAP can integrate the Jupiter notebook IDE with SAP DataSphere with many more libraries to support so that the data science and AI capabilities can be further explored.
I will cover the error handling part and the running/scheduling of dataflow in a separate topic under data integration monitor topic.
Target table and options.
Now, deploy and run the dataflow. This is how we create a dataflow in SAP Datasphere. Also, we can schedule the dataflow, or we can integrate with a task chain and schedule the task chain as well with other flows.
Points to be noted and limitations with dataflows are :
When we create a new remote table, it is created virtually using the data federation mechanism. Original data stays in the remote source system, and SAP Datasphere just points to that table.
We cannot use the remote table directly inside a dataflow. We need to create a view on top of that remote table and then consume the view inside a dataflow.
Kindly test this scenario from your side as well. Thank you for reading this blog post, and I hope you liked the content.
watch out for next set of topics in coming days
Thanks,
Narasingha
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 33 | |
| 21 | |
| 19 | |
| 18 | |
| 17 | |
| 14 | |
| 12 | |
| 10 | |
| 9 | |
| 9 |