Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
jens_rannacher
Product and Topic Expert
Product and Topic Expert
7,139
SAP Data Intelligence 3.0 which is the next evolution of SAP Data Hub provides new data transformation capabilities in the Modeler which I will describe how to use in this blog post.

There is a new set of pipeline operators that can be used to perform data transformations with a similar experience that you have with SAP HANA Flowgraphs. The transformations include for example projection, filter, column operation and join on different structured files such as CSV-, Parquet- and ORC-files as well as database tables and views. The new operators are all listed under the Structured Data Operators section in the Modeler:



The structured data operators use the Flowagent subengine for execution. From look and feel they are very similar to the existing operators in the Connectivity (via Flowagent) section:



The main difference between both operator categories is that only the new Structured Data Operators can be used together with the new Data Transform operator that is listed in the same category.

We plan to consolidate both operator categories in upcoming releases and with that we will also extend the source/target options within the Data Transform to everything that is available by FlowAgent.

For those of you who are wondering what the FlowAgent subengine is: FlowAgent is an engine that runs a light-weight containerized version of SAP Data Services with each pipeline containing FlowAgent operators. It leverages SAP Data Services adapters to provide the same capabilities on structured data through the FlowAgent operators in data pipelines.

So much for the background, let's now start using these new structured data operators.

Example Scenario:


In the following tutorial, we will combine information from unstructured data stored in Amazon S3 with enterprise data stored in SAP HANA using the new Data Transform operator. The scenario is depicted at a high-level in the following diagram:



 

Product Review Data in Amazon S3:


In Amazon S3 we have a CSV file



You can download the file from the attachments and upload it to any supported storage of your choice. Please refer to the documentation of the Structured File Consumer operator to find out which storage types are supported.

Product Master Data in SAP HANA:


Next to this, we have connected an SAP HANA instance that has stored product master data in a table called PRODUCT. This table includes product details such as the product name, the product category and the product price as it comes from an ERP system. Hence, we treat this data as enterprise data.



You can download the file and import it into your SAP HANA Cloud instance (if available) to try out the same tutorial by yourself. If you would like to use a different database type, please refer to the documentation of the Table Consumer operator to find out which databases are supported.

Read Product Reviews from Amazon S3


Now let's start using this data in a data pipeline. We first create a new Graph in the Modeler application and add a Structured File Consumer to read the product reviews from Amazon S3:


Open the Configuration of the Structured File Consumer operator and point it to the connected storage location where the file






Then choose the Connection ID (S3TEST in this case):



Open the Source File Browser, select the file



At this point, the operator extracts the metadata from the CSV file without running the pipeline. This metadata is required later once we add and connect the Data Transform operator.

Read Product Master Data from SAP HANA


Now, we well read the Product table from SAP HANA. For this we add a Table Consumer operator to the same graph:



Open the Configuration of the Table Consumer operator and choose the Database type (HANA in this case):



Then choose the Connection ID which is HANA_CLOUD in my example:



Open the Table Browser, select the table



In my example, I have connected an SAP HANA Cloud instance that stores the required product master data. If you have access to SAP HANA Cloud, you can import the file HANA_export.tar.gz via the SAP HANA Database Explorer as described below and then use it for this exercise.

Import Product Master Data into SAP HANA Cloud


Open the SAP HANA Data Explorer, right click on the database icon and then click on Import Catalog Objects:



Browse for the HANA_export.tar.gz on your local computer and then click on Import:



This will import the PRODUCT table into the database schema DEMO.

Join Product Reviews with Product Master Data using Data Transform


Now that we have both datasets available, let's continue by adding a Data Transform operator to the same graph (please make sure to choose the Data Transform from the Structured Data Operators category):



Drag and drop the output ports of the Structured File Consumer and Table Consumer on the Data Transform operator:



Double-click on the Data Transform operator to open the graphical transformation editor. As you have connected two source operators, you should now see two unconnected input nodes input1 and input2 in the editor:



Add a Join data operation to the editor and connect it with the input1 and input2 nodes:



Double-click on the Join node to define the join conditions. Drag-and-drop the column PRODUCT_ID from the one table to the column PRODUCTID of the other table and set the Join type to Left Outer. Please also make sure that Join_Input1 is the left table:



Remark: To reach maximum performance, it is necessary to choose different settings for the Cache and Rank properties depending on the source type and the cardinality. Please refer to the documentation of the Data Transform operator to find the right settings for your scenario. In our example, the data volumes are rather small, so we can neglect these settings for simplicity.

Next, click on the tab Columns and add the following columns to the output:


Click on the "<" icon in the upper left corner to get back to the transform editor.


Add a Projection node to the graph and connect it with the output of the Join node:



Double click on the Projection Node and add all columns to the output by clicking on the selected icon below:


Click on the tab Filter and then use the SQL helper to define a filter on the column COUNTRY to only include reviews with country code DE in the output:



Remark: Although optimizations are performed by the underlying engine during runtime, it may be better to add the projection with the filter before the join to really make sure that the filter is pushed down. However, this example is just to demonstrate the functionality and not to reach maximum performance.

Click on the "<" icon to get back to the data transform graph.

Right-click on the output port of the Projection node and then click on Create Data Target:



This should create an output port in the transformation editor that you can later use within the first level graph:



With that we have finished defining the transformation logic. Let's now switch back to the first level of the graph by clicking on the "<" icon and then continue with the last step.

Write the enriched data as Parquet to Amazon S3


In the last step, we will write the results back to Amazon S3 using a different file format, which is Parquet. For this, we add a Structured File Producer operator to the same graph an connect it with the output port of the Data Transform operator:



Open the Configuration of the Structured File Producer and choose a Storage Type (S3 in this case):



Choose an existing Connection ID (S3TEST in this example):



Provide a meaningful target file name, e.g. enriched_product_reviews.parquet and choose PARQUET as Format:



Add a Graph Terminator operator to the graph and connect it with the Structured File Producer. The Graph Terminator will stop the execution of the graph automatically once the Parquet file is created:



That's it, let's now Save the graph and click on Run.

Wait until the graph switches into Running and Completed state:


Inspect the results:


After the graph execution has completed, you can use the Metadata Explorer application to inspect the written Parquet file:



In there you should see, that the product reviews have been enriched with product master data such as the product name, description and price and there should only be product reviews with country code equal to 'DE'.

That's actually it, as simple as that :-).
16 Comments
0 Kudos
Hi Jens,

That's a great explanation and helpful indeed!!

structuredfileproducer1 operator overwrites the target file if already exists by default which is understood. It also creates new files if we use placeholders as postfix.. example *_<counter>.csv which is also working.

Question: How does it append the data in the same target file ? as the "Mode" is missing in the structuredfileproducer1 operator. My target is SDL but not GCS, S3 or WASB.. So wondering if Mode option is missing and we cannot specify any option, will this overwrite the file or append the file ??

BR, Mahesh
jens_rannacher
Product and Topic Expert
Product and Topic Expert
Hi Mahesh,

the mode-option is only exposed for those storage types that support a physical append such as ADL or HDFS. Object stores such as S3 and SDL do not support the append operation on storage level. To support it on File Producer level would mean that first the file would have to be downloaded, appended locally and then uploaded again which would be very expensive and slow for larger files. We are working of providing an option that new data is written into part-files which would result in an append-style on folder-level, but files are kept as immutable.

Best regards
Jens
ravicondamoor
Advisor
Advisor
0 Kudos
How can I get access to the data used in this demo?

Thanks

ravi
jens_rannacher
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi Ravi,

you can download the HANA demo data from here: https://github.com/SAP-samples/datahub-integration-examples/raw/master/SampleData/HANA/ProductMaster...

Best regards
Jens
ravicondamoor
Advisor
Advisor
0 Kudos
Thanks Jens.

Can I use the new Table Consumer/Producer/Transform operators with SAP IQ?
ravicondamoor
Advisor
Advisor
0 Kudos
Can I also get the product_reviews_2017-2019.csv file, please?

 
marcus_schiffer
Active Participant
0 Kudos
Hi,

 

many thanks for the blog, it helped a lot with the new operators. I tried with a different Data Set and need a new column to be created.

I see functions like case() but no help can be found on how to use this. Tried some variations but no success.

Could you kindly give some hint where to find the documentation with examples ?

Regards

Marcus
0 Kudos
Hi Jens,

 

Thank you for the document.

In the same lines tried reading file from Azure ADL G2 and loading data to same target to ADL G2 but when i try to run facing issue, invalid port type table.

Could you please let us know what I should correct in my graph.

 


Error Screenshot

 

virenp_devi
Contributor
0 Kudos
Thank you Jens. Is there any operator for pulling data from SAP tables like transperent tables?
jens_rannacher
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi Ravi,

sorry for the late reply.

Yes, you can use both Table Consumer and Producer for SAP IQ. The transform is independent of the source/target which is used in the consumer/producer operators.

Best regards
Jens
jens_rannacher
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi Marcus,

when you click and hover over the SQL functions in some cases, e.g. for case(), expression example are shown. For more information on each function, see the "SQL Functions" topic in the SAP HANA SQL and System Views Reference guide.

Does this help?

Best regards
Jens
jens_rannacher
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi,

did you run the graph in debug mode and the error occurs there or does it happen when you run the graph in normal mode? In any case, you may create an SAP incident if you still face this issue.

Best regards
Jens
jens_rannacher
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi,

you can pull data from cluster tables using the SLT Connector Operator in combination with the SAP cluster table splitter.

Best regards
Jens
marcus_schiffer
Active Participant
0 Kudos
Hi Jens,

 

thanks. Found the documentation !

 

Regards

Marcus
marcus_schiffer
Active Participant

Hi,

 

sometimes it may be needed to parameterize data transform. We see that variables can be used in the e.g. filter section like "date "> ${date_variable}. But that requires the variable to be fixed and set on start of the graph. How to use this dynamically in a data transform graph ? (e.g. reading the variable from a file  like reading the ${date_variable} and filter ) ?

Is this possible ?

any help appreciated.

Regrads

Marcus

 

ericranang
Explorer
0 Kudos
Hello everyone,

I want to know three things:

1. How can I read the parquet format using hdfs ?

2. I have other HANNA Data Base servers, how to know the port or localhost?

3. I can't use the sapdi command, I think it's because I'm only connected to the Datalake port, could you give me more information about it?

Thanks!