Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
SeungjoonLee
Product and Topic Expert
Product and Topic Expert
1,110

thomashammer_0-1727786958782.png

In today's data-driven world, enterprises collect more data than ever before. Unlocking the potential hidden insights within this data, while seamlessly connecting the dots among both SAP and non-SAP applications, is not only essential for adapting to an ever-changing business environment but also crucial for differentiating your business from competitors.

Imagine having a treasure trove of business data stored in common formats like CSV, Parquet, and Delta table (Open Table Format). Each holds a vast volume of valuable information. Starting with SAP HANA Cloud QRC 03/2024, I'm thrilled to announce that SAP HANA Database offers native SQL on Files capabilities. This unlocks these chests with powerful analytical query processing capabilities, like what has been done with SAP HANA Database before. This enhancement makes your structured data in files instantly accessible and incredibly powerful while leveraging all the existing capabilities of SAP HANA Cloud.

This new native SQL on Files capability allows you to directly access and read data in SAP HANA Cloud, data lake Files, especially in CSV, Parquet, and Delta table (Open Table Format), without moving or loading them into SAP HANA Database. This will also enable better control of the cost-performance balance for the workload, with increased flexibility. In a nutshell, this new capability includes the ability to:

  • Link your SAP HANA Cloud, SAP HANA Database to SAP HANA Cloud, data lake Files with one click in SAP HANA Cloud Central (only available in multi-environment edition).
  • Access data in files stored in SAP HANA Cloud, data lake Files on-the-fly by creating read-only virtual tables that point to the files or directory, and then executing SQL queries.
  • Benefit from support for file formats such as CSV, Parquet, and Delta table (Delta table is an Open Table Format table based on Parquet files). Since Delta table provides table semantics, it is the preferred choice for query processing.
  • Execute time travel queries on Delta table by specifying the version or timestamp in the SELECT statement.

Overview.png

 

Prerequisites and considerations

Please ensure that all the following prerequisites are met before following the steps in this blog.

When uploading files, if both Parquet and Delta Lake formats are used, it is highly recommended to separate directories because both formats basically use the same file extension (.parquet), while Delta Lake has an additional _delta_log folder for the transaction log.

 

Getting started with SAP HANA Database Native SQL on Files in SAP HANA Cloud

If the above conditions are met and the files are ready in data lake Files, let's configure the connection between the SAP HANA Database and the data lake Files with a simple one-click process in SAP HANA Cloud Central.

Link your SAP HANA Cloud, SAP HANA Database to SAP HANA Cloud, data lake Files:

If you have provisioned the SAP HANA Database and data lake Files in SAP HANA Cloud, you can see both in SAP HANA Cloud Central. To use SQL on Files, a secure connection between the SAP HANA Database and data lake Files should be created. For this, if you click on your SAP HANA Database in SAP HANA Cloud, you will now see a new tab named Linked Data Lake Files, which lists all the Data Lake Files instances in the same subaccount, as shown below. Please note that this feature is only available in the multi-environment edition of SAP HANA Cloud Central. To use the multi-environment edition, please follow the instruction in Subscribing to the SAP HANA Cloud Administration Tools.

Link.png

In this screen, all you need to do is simply click Link, and it will handle everything, including certificates, trust, authorization, and remote source creation.

With this one click, as you can see in the example below, my demo-hdb is linked to demo-hdlf via an auto-generated remote source called demo-hdlf_rs.

Linked.png

You can also unlink and delete the auto-generated remote source by simply clicking Unlink. However, it will drop the remote source and all objects such as virtual tables that depend on it in the SAP HANA Database. Even though this action will not change anything in your data lake Files, careful operation is required as it cannot be undone. And if you link data lake Files via SAP HANA Cloud Central, please use unlink and do not drop the auto-generated remote source manually.

Please also refer to the links below for further details.

Alternatively, the simple one-click process described above can also be achieved through SQLs and manual configurations. However, since using the one-click SAP HANA Cloud Central way is the recommended method, I will not address this alternative method in this blog. If you're interested in this alternative method, please refer to the links below.

Create virtual tables:

Once you link the SAP HANA Database to data lake Files, virtual tables can be created by pointing to files stored in the data lake Files. All examples in this blog are based on TLC Trip Record Data.

Let's start with a simple CSV file. To create a virtual table by pointing to a CSV file, you can use the CREATE VIRTUAL TABLE statement. In this case, column names and data types should be defined as in the example below.

-- create a virtual table by pointing to a CSV file stored in data lake Files
CREATE VIRTUAL TABLE DEMO.TAXI_ZONE (
    "LocationID" INT PRIMARY KEY,
    "Borough" NVARCHAR(20),
    "Zone" NVARCHAR(50),
    "service_zone" NVARCHAR(20)
    ) AT "demo-hdlf_rs"."/csv/" AS CSV;

In this case, you should make sure that the CSV file does not have column headers, and the defined data type should be large enough to read the data. Otherwise, you will see the "too large value for column" error when executing a query on the virtual table, even though the virtual table is successfully created.

Please also find the example result in the screenshot below.

CSV Result1.png

In SQL on Files, you can also specify the file name (e.g., /csv/taxi_zone_lookup.csv) instead of just the directory (/csv/). Using the directory allows for the automatic detection of any additionally uploaded files, while using the specific file name will not auto-detect any additionally uploaded files.

Now, let’s exclude the second column from the virtual table. Please see the below example.

-- create a virtual table by pointing to a CSV file without second column
CREATE VIRTUAL TABLE DEMO.TAXI_ZONE_NO_BOROUGH (
    "LocationID" INT PRIMARY KEY,
    "Zone" NVARCHAR(50),
    "service_zone" NVARCHAR(20)
    ) AT "demo-hdlf_rs"."/csv/" AS CSV FILE COLUMN LIST(1, 3, 4);

Please also find the example result in the screenshot below.

CSV Result2.png

Here, as you can see, the second column Borough is excluded by using FILE COLUMN LIST(1, 3, 4), which means that only the first, third, and fourth columns are referred to when creating a virtual table.

Alright, now let’s see a Parquet example. For Parquet and Delta table, you can use GET_REMOTE_SOURCE_FILE COLUMNS built-in procedure to retrieve column information. Please note that this built-in procedure cannot be used for CSV files.

-- retrieve column information (parquet, delta table only)
CALL GET_REMOTE_SOURCE_FILE_COLUMNS ('demo-hdlf_rs', '/parquet/', 'PARQUET');

Please also find the example result in the screenshot below.

Procedure.png

As you can see, you can find the recommended data types for each column in the SQL_DATA_TYPE. For more information, please refer to the link below.

With this information, I can easily write a CREATE VIRTUAL TABLE statement like the one below.

-- create a virtual table by pointing to Parquet files
CREATE VIRTUAL TABLE DEMO.YELLOW_TRIP (
    "VendorID" INT,
    "tpep_pickup_datetime" TIMESTAMP,
    "tpep_dropoff_datetime" TIMESTAMP,
    "passenger_count" BIGINT,
    "trip_distance" DOUBLE,
    "RatecodeID" BIGINT,
    "store_and_fwd_flag" NVARCHAR(1),
    "PULocationID" INT,
    "DOLocationID" INT,
    "payment_type" BIGINT,
    "fare_amount" DOUBLE,
    "extra" DOUBLE,
    "mta_tax" DOUBLE,
    "tip_amount" DOUBLE,
    "tolls_amount" DOUBLE,
    "improvement_surcharge" DOUBLE,
    "total_amount" DOUBLE,
    "congestion_surcharge" DOUBLE,
    "Airport_fee" DOUBLE
    ) AT "demo-hdlf_rs"."/parquet/" AS PARQUET;

In this example, since I know the store_and_fwd_flag column only has Y or N, I set this as NVARCHAR(1).
Please also find the example result in the screenshot below.

Parquet Result.png

If needed, FILE COLUMN LIST can also be used for Parquet, as shown in the example below.

-- create a virtual table by pointing to Parquet files with 3 columns
CREATE VIRTUAL TABLE DEMO.YELLOW_TRIP (
    "VendorID" INT,
    "congestion_surcharge" DOUBLE,
    "Airport_fee" DOUBLE
    ) AT "demo-hdlf_rs"."/parquet/" AS PARQUET FILE COLUMN LIST(1, 18, 19);

For CSV and Parquet, directory-based partitioning is supported for better performance by pruning unnecessary files during query execution. For more information, please refer to the link below.

Last but not least, let's see a Delta table example. For this example, the Parquet files shown above are converted into a Delta table. Also, the timestampNtz data types are converted to timestamp because currently, Reader Version 1 is the only version supported in SQL on Files.

For reading a Delta table, the entire Delta table path should be uploaded. Please also refer to the Converting from Parquet to Delta Lake. Different from CSV and Parquet, a virtual table can be created without defining the columns and corresponding data types, as shown in the example below.

-- create a virtual table by pointing to a delta table
CREATE VIRTUAL TABLE DEMO.YELLOW_TRIP_DELTA AT "demo-hdlf_rs"."/delta/" AS DELTA;

If necessary, you can also define data types for each column, but different from CSV or Parquet, the column name must be the same as the Delta table column name. Additionally, FILE COLUMN LIST and directory-based partitioning are not supported with a Delta table.

For a Delta table, time travel queries are supported, as shown in the examples below.

-- the latest version
SELECT COUNT(*) FROM DEMO.YELLOW_TRIP_DELTA;

-- version 1
SELECT COUNT(*) FROM DEMO.YELLOW_TRIP_DELTA FOR VERSION AS OF '1';

In addition to the version (FOR VERSION AS OF), a timestamp (FOR SYSTEM_TIME AS OF) can also be used. For more information, please refer to the link below.

Additionally, you can check the latest version of the Delta table with the DELTA_LAKE_TABLE_VERSION built-in procedure and retrieve the change data feed between given versions of a Delta table with the DELTA_LAKE_TABLE_CHANGES built-in procedure. However, to use the DELTA_LAKE_TABLE_CHANGES built-in procedure, the enableChangeDataFeed must be set for the Delta table. Please also refer to the links below.

 

Recommendations and supported data types

Now that you have understood the basic operations of using the SAP HANA Database Native SQL on Files capability in SAP HANA Cloud, please carefully review the recommendations and supported data types provided in the links below before proceeding further.

 

Monitoring, performance analysis, and trouble shooting

For monitoring, there are various system views available to monitor SQL on Files operations. Please refer to the link provided below.

For further performance analysis, including EXPLAIN PLAN and SQL Analyzer Tool for SAP HANA, please refer to the detailed information provided in the link below.

For troubleshooting, traces can be used to address issues. Please also find information on how to set trace levels in the link below.

 

FAQs

Q: To use this SQL on Files feature, do I need to subscribe to any additional service or pay more?
A: No, all you need is a right-sized SAP HANA Database and data lake Files in SAP HANA Cloud. It requires proper sizing of memory and computing for the SAP HANA Database, as well as the amount of data read from the system during a specific month (API calls) of data lake files. Please refer to the SAP HANA Cloud Capacity Unit Estimator.

Q: Are there any limitations on the file size or volume that SQL on Files can handle?
A: There is no hard limitation, but it is limited by the provisioned SAP HANA Database size in SAP HANA Cloud. If the required resources to execute an SQL on Files query are higher than the provisioned SAP HANA Database size, such queries cannot be executed and scaling up of SAP HANA Database should be considered. To overcome the limitation of a single SAP HANA Database instance, we have a plan to support elastic compute nodes in the future.

Q: This SQL on Files feature is read-only from the SAP HANA Cloud Database perspective. If I want to execute DMLs on a Delta table, how can I do this?
A: Since Apache Spark can be used with the data lake Files, DMLs are expected to be executed separately via Apache Spark. Please refer to the link below.

Q: It seems that it shares the same virtual table or federation framework as SAP HANA smart data access (SDA). Does this mean it also supports toggling to replica tables or creating a view on it for static result cache?
A: Yes, toggling between virtual tables and replica tables is supported, but only snapshot replicas are supported. Please refer to the link below.

Creating a view on a virtual table and then creating a static result cache on the view is also supported.

Q: Is it recommended to create statistics for better performance?
A: SQL on Files virtual tables offer runtime data statistics collection, so you don't need to explicitly create statistics with the CREATE STATISTICS statement since this process happens automatically. Please refer to the link below.

  • SAP HANA Cloud, SAP HANA Database SQL on Files Guide: Statistics

 

Conclusion

In conclusion, the new SAP HANA Database Native SQL on Files capability in SAP HANA Cloud marks a significant advancement in unlocking the potential of data within different file formats. This enhancement not only provides instant access to valuable business data in files but also seamlessly integrates them with existing data and capabilities in SAP HANA Database.

With this direct and instant access to data lake Files in SAP HANA Cloud without ingestion, it will offer new possibilities for enterprises seeking to leverage their data for strategic decision-making, while enabling a modern data analytics and data engineering ecosystem.

Continuously, we are planning for further enhancements and additional features in the future, such as Delta Sharing support and access to external object storages. Please visit our SAP Road Map Explorer to stay tuned for the upcoming updates.