Uploading AWS S3 bucket data into SAP HANA Cloud, Data lake
In this blog, we will explore how to load the data in AWS S3 bucket into the SAP HANA Cloud, Data Lake. We will load a csv file and a parquet file.
Prerequisites
- You should have EXECUTE privilege on SYSRDL#CG.REMOTE_EXECUTE procedure.
- You should have read access rights to objects in the AWS S3 bucket
- The objects in the AWS S3 bucket should be using CESU-8 character encoding.
- You should know the column delimiter and the row delimiter of the data residing in AWS S3 bucket.
- You should also know the AWS Access Key ID and Secret Access Key ID
Let’s have a look at the AWS bucket. It contains the csv files that we will upload to the data lake.
AWS bucket with csv files
Log into the SAP HANA Cloud console and open the SQL console. Create a table in the data lake with the same structure as the data in the Amazon S3 bucket.
CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE DEMO2
(first_name VARCHAR(10) ,
last_name VARCHAR(13) ,
city VARCHAR(19) ,
state VARCHAR(2) ,
phone1 VARCHAR(12) )');
Note that all commands inside REMOTE_EXECUTE have to be enclosed in single quotes.
Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:
CALL SYSRDL#CG.REMOTE_EXECUTE ('GRANT SELECT ON DEMO2 TO SYSRDL#CG_ROLE');
Now load the table using the load command like below
CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE DEMO2
(first_name, last_name ,city ,state,phone1 ''\x0a'')
USING FILE ''s3://hscbucket/demo2.txt''
DELIMITED BY '',''
ACCESS_KEY_ID ''SAMPLEACCESSKEY''
SECRET_ACCESS_KEY ''samplesecretaccesskey''
REGION ''ap-southeast-1''
ESCAPES OFF
');
In the above example, the field delimiter is comma (,) and the row delimiter is new line.
We will need to specify the location of the file in S3, AWS region, AccessKeyId and the SecretAccessKey in the load command.
Note that each single quotation marks of the Load command have to be escaped by another single quote.
Here is the screenshot of commands in the SQL editor.
SQL commands in the editor
The load command allows loading of gzip compressed files as well. The extension of the file in S3 should be .gz to load the compressed files.
Now we will navigate to Remote Sources and choose SYSRDL#CG_SOURCE which is our data lake. We will choose the schema SYSRDL#CG . There we can find the table DEMO2
Searching the data lake schema
Now we will check the box and click on Create Virtual Object to create a Virtual table in SAP HANA Cloud. The below dialog box will open.
Creating Virtual Table in SAP HANA Cloud
Once the table is created, we navigate to Tables and can see DEMO2 under the tables.
Virtual table DEMO2 in DBADMIN schema
We can right click and see the data.
Preview of data in the table
Now we will load a parquet file from the S3 bucket. Here is a screenshot of the parquet file
userdata1.parquet residing in my S3 bucket.
parquet file in the S3 bucket
First, let’s create a table in the data lake with the same schema as the parquet file.
CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE PARQUET_TAB
(C1 datetime ,
C2 int,
C3 VARCHAR(50),
C4 VARCHAR(50),
C5 VARCHAR(50),
C6 VARCHAR(50),
C7 VARCHAR(50),
C8 VARCHAR(50),
C9 VARCHAR(50),
C10 VARCHAR(50),
C11 double,
C12 VARCHAR(50),
C13 VARCHAR(50))
');
Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:
CALL SYSRDL#CG.REMOTE_EXECUTE (‘GRANT SELECT ON PARQUET_TAB TO SYSRDL#CG_ROLE’);
Next we will load the file using the below Load Table command. Note that we will need to specify the FORMAT as parquet along with the location of the file in S3. Also the AWS region, AccessKeyID and SecretAccessKey need to be mentioned in the load command.
CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE PARQUET_TAB (C1, C2 ,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13)
USING FILE ''s3://hscbucket/userdata1.parquet''
FORMAT parquet
ACCESS_KEY_ID ''SAMPLEACCESSKEYID''
SECRET_ACCESS_KEY ''Secret/Access+Key''
REGION ''ap-southeast-1''
ESCAPES OFF
');
Here is a screenshot of the commands executed in the SQL console.
SQL console screenshot
Navigate to the Remote Sources in the Database explorer and choose SYSRDL#CG_SOURCE. Then choose SYSRDL#CG as schema and input the name of the table in the Object and click search.
Data lake schema search
The newly created table will appear. Now click the checkbox and then click on
Create Virtual Objects. The below dialog box will appear.
Creating virtual table in SAP HANA Cloud
Choose Create a virtual table will be created in the DBADMIN schema. You can navigate to the Tables section and select the table to see the schema.
PARQUET_TAB table schema
Now we can right click and display the data that was loaded from the parquet file.
Data preview of PARQUET_TAB table
Conclusion
The SAP HANA Cloud, Data Lake can ingest data from S3 easily using the Load Table command. The command is versatile enough to load csv and parquet format files. It can load compressed files as well as load multiple tables in parallel as well.