Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
I501256
Advisor
Advisor

Uploading AWS S3 bucket data into SAP HANA Cloud, Data lake


In this blog, we will explore how to load the data in AWS S3 bucket into the SAP HANA Cloud, Data Lake. We will load a csv file and a parquet file.


Prerequisites

  • You should have EXECUTE privilege on SYSRDL#CG.REMOTE_EXECUTE procedure.

  • You should have read access rights to objects in the AWS S3 bucket

  • The objects in the AWS S3 bucket should be using CESU-8 character encoding.

  • You should know the column delimiter and the row delimiter of the data residing in AWS S3 bucket.

  • You should also know the AWS Access Key ID and Secret Access Key ID


 

Let’s have a look at the AWS bucket. It contains the csv files that we will upload to the data lake.


AWS bucket with csv files


Log into the SAP HANA Cloud console and open the SQL console. Create a table in the data lake with the same structure as the data in the Amazon S3 bucket.
CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE DEMO2 
(first_name VARCHAR(10) ,
last_name VARCHAR(13) ,
city VARCHAR(19) ,
state VARCHAR(2) ,
phone1 VARCHAR(12) )');

Note that all commands inside REMOTE_EXECUTE have to be enclosed in single quotes.

Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:
CALL SYSRDL#CG.REMOTE_EXECUTE ('GRANT SELECT ON DEMO2 TO SYSRDL#CG_ROLE');

Now load the table using the load command like below
CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE DEMO2 
(first_name, last_name ,city ,state,phone1 ''\x0a'')
USING FILE ''s3://hscbucket/demo2.txt''
DELIMITED BY '',''
ACCESS_KEY_ID ''SAMPLEACCESSKEY''
SECRET_ACCESS_KEY ''samplesecretaccesskey''
REGION ''ap-southeast-1''
ESCAPES OFF
');

In the above example, the field delimiter is comma (,) and the row delimiter is new line.

We will need to specify the location of the file in S3, AWS region, AccessKeyId and the SecretAccessKey in the load command.

Note that each single quotation marks of the Load command have to be escaped by another single quote.

Here is the screenshot of commands in the SQL editor.


SQL commands in the editor


 

The load command allows loading of gzip compressed files as well. The extension of the file in S3 should be .gz to load the compressed files.

Now we will navigate to Remote Sources and choose SYSRDL#CG_SOURCE which is our data lake. We will choose the schema SYSRDL#CG . There we can find the table DEMO2


Searching the data lake schema


Now we will check the box and click on Create Virtual Object to create a Virtual table in SAP HANA Cloud. The below dialog box will open.

 


Creating Virtual Table in SAP HANA Cloud


 

Once the table is created, we navigate to Tables and can see DEMO2 under the tables.


Virtual table DEMO2 in DBADMIN schema


 

We can right click and see the data.


Preview of data in the table


 

Now we will load a parquet file from the S3 bucket. Here is a screenshot of the parquet file userdata1.parquet residing in my S3 bucket.

 


parquet file in the S3 bucket


 

First, let’s create a table in the data lake with the same schema as the parquet file.
CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE PARQUET_TAB 
(C1 datetime ,
C2 int,
C3 VARCHAR(50),
C4 VARCHAR(50),
C5 VARCHAR(50),
C6 VARCHAR(50),
C7 VARCHAR(50),
C8 VARCHAR(50),
C9 VARCHAR(50),
C10 VARCHAR(50),
C11 double,
C12 VARCHAR(50),
C13 VARCHAR(50))
');

 

Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:
CALL SYSRDL#CG.REMOTE_EXECUTE (‘GRANT SELECT ON PARQUET_TAB TO SYSRDL#CG_ROLE’);

Next we will load the file using the below Load Table command. Note that we will need to specify the FORMAT as parquet along with the location of the file in S3. Also the AWS region, AccessKeyID and SecretAccessKey need to be mentioned in the load command.
CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE PARQUET_TAB (C1, C2 ,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13)  
USING FILE ''s3://hscbucket/userdata1.parquet''
FORMAT parquet
ACCESS_KEY_ID ''SAMPLEACCESSKEYID''
SECRET_ACCESS_KEY ''Secret/Access+Key''
REGION ''ap-southeast-1''
ESCAPES OFF
');

Here is a screenshot of the commands executed in the SQL console.


SQL console screenshot


Navigate to the Remote Sources in the Database explorer and choose SYSRDL#CG_SOURCE. Then choose SYSRDL#CG as schema and input the name of the table in the Object and click search.


Data lake schema search


 

The newly created table will appear. Now click the checkbox and then click on Create Virtual Objects. The below dialog box will appear.

 


Creating virtual table in SAP HANA Cloud


 

Choose Create a virtual table will be created in the DBADMIN schema. You can navigate to the Tables section and select the table to see the schema.


PARQUET_TAB table schema


Now we can right click and display the data that was loaded from the parquet file.


Data preview of PARQUET_TAB table


 

Conclusion


The SAP HANA Cloud, Data Lake can ingest data from S3 easily using the Load Table command. The command is versatile enough to load csv and parquet format files. It can load compressed files as well as load multiple tables in parallel as well.
2 Comments