SAP Data Intelligence 3.0 which is the next evolution of SAP Data Hub provides new data transformation capabilities in the Modeler which I will describe how to use in this blog post.
There is a new set of pipeline operators that can be used to perform data transformations with a similar experience that you have with SAP HANA Flowgraphs. The transformations include for example projection, filter, column operation and join on different structured files such as CSV-, Parquet- and ORC-files as well as database tables and views. The new operators are all listed under the
Structured Data Operators section in the Modeler:
The structured data operators use the
Flowagent subengine for execution. From look and feel they are very similar to the existing operators in the
Connectivity (via Flowagent) section:
The main difference between both operator categories is that only the new
Structured Data Operators can be used together with the new
Data Transform operator that is listed in the same category.
We plan to consolidate both operator categories in upcoming releases and with that we will also extend the source/target options within the Data Transform to everything that is available by FlowAgent.
For those of you who are wondering what the FlowAgent subengine is: FlowAgent is an engine that runs a light-weight containerized version of SAP Data Services with each pipeline containing FlowAgent operators. It leverages SAP Data Services adapters to provide the same capabilities on structured data through the FlowAgent operators in data pipelines.
So much for the background, let's now start using these new structured data operators.
Example Scenario:
In the following tutorial, we will combine information from unstructured data stored in
Amazon S3 with enterprise data stored in
SAP HANA using the new Data Transform operator. The scenario is depicted at a high-level in the following diagram:
Product Review Data in Amazon S3:
In Amazon S3 we have a CSV file
product_reviews_2017-2019.csv with product reviews from an online shop. We treat this data as semi-structured as it contains next to columnar (structured) data also written texts (unstructured data). The following picture illustrates the structure of this file in the data preview of the Metadata Explorer application:
You can download the file
product_reviews_2017-2019.csv from the attachments and upload it to any supported storage of your choice. Please refer to the documentation of the
Structured File Consumer operator to find out which storage types are supported.
Product Master Data in SAP HANA:
Next to this, we have connected an SAP HANA instance that has stored product master data in a table called
PRODUCT. This table includes product details such as the product name, the product category and the product price as it comes from an ERP system. Hence, we treat this data as enterprise data.
The following picture illustrates this table in the data preview of the Metadata Explorer application:
You can download the file
HANA_export.tar.gz and import it into your SAP HANA Cloud instance (if available) to try out the same tutorial by yourself. If you would like to use a different database type, please refer to the documentation of the
Table Consumer operator to find out which databases are supported.
Read Product Reviews from Amazon S3
Now let's start using this data in a data pipeline. We first create a new
Graph in the
Modeler application and add a
Structured File Consumer to read the product reviews from Amazon S3:
Open the Configuration of the Structured File Consumer operator and point it to the connected storage location where the file product_reviews_2017-2019.csv resides:
First choose the Storage Type (s3 in this case):
Then choose the
Connection ID (
S3TEST in this case):
Open the
Source File Browser, select the file
product_reviews_2017-2019.csv and click on OK:
At this point, the operator extracts the metadata from the CSV file without running the pipeline. This metadata is required later once we add and connect the Data Transform operator.
Read Product Master Data from SAP HANA
Now, we well read the Product table from SAP HANA. For this we add a
Table Consumer operator to the same graph:
Open the
Configuration of the
Table Consumer operator and choose the
Database type (
HANA in this case):
Then choose the
Connection ID which is
HANA_CLOUD in my example:
Open the
Table Browser, select the table
PRODUCT and click on OK:
In my example, I have connected an SAP HANA Cloud instance that stores the required product master data. If you have access to SAP HANA Cloud, you can import the file HANA_export.tar.gz via the SAP HANA Database Explorer as described below and then use it for this exercise.
Import Product Master Data into SAP HANA Cloud
Open the
SAP HANA Data Explorer, right click on the database icon and then click on
Import Catalog Objects:
Browse for the
HANA_export.tar.gz on your local computer and then click on
Import:
This will import the
PRODUCT table into the database schema
DEMO.
Join Product Reviews with Product Master Data using Data Transform
Now that we have both datasets available, let's continue by adding a
Data Transform operator to the same graph (please make sure to choose the Data Transform from the Structured Data Operators category):
Drag and drop the output ports of the
Structured File Consumer and
Table Consumer on the
Data Transform operator:
Double-click on the
Data Transform operator to open the graphical transformation editor. As you have connected two source operators, you should now see two unconnected input nodes
input1 and
input2 in the editor:
Add a
Join data operation to the editor and connect it with the
input1 and
input2 nodes:
Double-click on the
Join node to define the join conditions. Drag-and-drop the column
PRODUCT_ID from the one table to the column
PRODUCTID of the other table and set the
Join type to
Left Outer. Please also make sure that
Join_Input1 is the
left table:
Remark: To reach maximum performance, it is necessary to choose different settings for the Cache and Rank properties depending on the source type and the cardinality. Please refer to the documentation of the Data Transform operator to find the right settings for your scenario. In our example, the data volumes are rather small, so we can neglect these settings for simplicity.
Next, click on the tab
Columns and add the following columns to the output:
Click on the "<" icon in the upper left corner to get back to the transform editor.
Add a
Projection node to the graph and connect it with the output of the
Join node:
Double click on the
Projection Node and add all columns to the output by clicking on the selected icon below:
Click on the tab
Filter and then use the
SQL helper to define a filter on the column COUNTRY to only include reviews with country code
DE in the output:
Remark: Although optimizations are performed by the underlying engine during runtime, it may be better to add the projection with the filter before the join to really make sure that the filter is pushed down. However, this example is just to demonstrate the functionality and not to reach maximum performance.
Click on the "<" icon to get back to the data transform graph.
Right-click on the output port of the
Projection node and then click on
Create Data Target:
This should create an output port in the transformation editor that you can later use within the first level graph:
With that we have finished defining the transformation logic. Let's now switch back to the first level of the graph by clicking on the "<" icon and then continue with the last step.
Write the enriched data as Parquet to Amazon S3
In the last step, we will write the results back to Amazon S3 using a different file format, which is Parquet. For this, we add a
Structured File Producer operator to the same graph an connect it with the output port of the
Data Transform operator:
Open the
Configuration of the
Structured File Producer and choose a
Storage Type (
S3 in this case):
Choose an existing
Connection ID (S3TEST in this example):
Provide a meaningful
target file name, e.g.
enriched_product_reviews.parquet and choose
PARQUET as
Format:
Add a
Graph Terminator operator to the graph and connect it with the
Structured File Producer. The Graph Terminator will stop the execution of the graph automatically once the Parquet file is created:
That's it, let's now
Save the graph and click on
Run.
Wait until the graph switches into
Running and
Completed state:
Inspect the results:
After the graph execution has completed, you can use the Metadata Explorer application to inspect the written Parquet file:
In there you should see, that the product reviews have been enriched with product master data such as the product name, description and price and there should only be product reviews with country code equal to 'DE'.
That's actually it, as simple as that :-).