
This blog is about data management, data tiering, data temperature, data archiving… A lot of topics, or labels really, as in essence they all have one thing at its core – making sure necessary data is available at the right time and at the right cost.
In the good tradition of blog series developed by SAP HANA Database & Analytics, Cross Solution Management team, I wanted to present a sample end-to-end use case including all relevant components of such case in as much detail as necessary.
I will use sample data scenario to discuss the following topics:
If you have watched the TechEd'21 session Examine the Cloud-Native Features of SAP HANA Cloud and Discover the Value [DAT105], and more recently the TechEd'22 session SAP HANA Cloud: Data Lake and Data Tiering Overview [DA108], you would have heard the term Data Pyramid. If not, you would have most likely seen a picture like the one below in the context of data archiving, data tiering, data temperature, or cost-effective data storage/persistency.
(Please do not confuse this with other concepts like DIKW or data science pyramid.)
The idea is simple enough – place your data in a tier which will provide the required capabilities (like access and performance) at the most cost-effective way. So, use your in-memory tier for data used frequently and/or with critical latency requirements. Then push data down the pyramid as needed.
In this blog, I would like to share my thoughts on two approaches involving the data and the pyramid:
The difference between the two is rather subtle – in the end, we are looking at certain rules defining which data should reside in which tier. In the former case, we apply the rules to data already in the pyramid to demote or promote it within the pyramid. In the latter, we acquire data from external sources and decide on the fly, in which tier the data will be placed in the pyramid. I hope the reason I draw that distinction will become clearer a little bit later.
Quick run through the tiers in our pyramid and associated terminology. The pyramid depicted above is built using following components (wording to large degree sourced from SAP HANA Cloud, Data Lake terminology:
Components of SAP HANA Cloud:
For completeness, it should be noted that several of the concepts and capabilities described in this blog are also available for SAP HANA on-premise deployments, but the focus here will be on SAP HANA Cloud.
In my scenario, I will be using SAP HANA Cloud, Data Lake Relational Engine (HANA DB-managed). Refer to Ways to Create a Data Lake to learn about this and alternative (standalone) deployment.
As in many of our End-to-End Data & Analytics Scenarios, I used the Tankerkönig data set. While this blog described the process of acquiring current data via API (using SAP Data Intelligence), I focused on historical data set as published here. The historical data set I used has following volumes:
Year | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 (*) | Total |
Record count | 16,264,865 | 35,727,313 | 44,340,894 | 58,402,045 | 65,544,845 | 82,326,042 | 100,117,404 | 90,051,867 | 492,775,275 |
(*) Data from year 2021 covers up to Oct 21st.
The data is available as set of csv files containing prices across all stations and several samples a day organised in following manner:
So, we have (number of days in a month x) files contained in a Month folder, which in turn is contained in Year folder. This is quite handy – will explain later why.
I want to distribute the data – according to its age – across the tiers:
My data set and its tier classification represented on the data pyramid:
In case of HANA Cloud and Native Storage Extension, I have two options with regards to splitting data:
For my exercise I decided for 2nd option – that is maintain single table and partitioned it using range partitioning by YEAR. That way I can easily change Tier of the whole partition without a need to transfer data.
There may be scenarios where 1st option might be preferable – for example when you want to restrict access to data in NSE tier (like block changes) and it can be easier done with separate tables. In such scenario, change of the tier of data subset will require transfer of data from one table to another.
In the context of the data pyramid and various tiers, it is important to consider capabilities and requirements with regards to data acquisition, data movement (between tiers), and data consumption:
Diagram below shows sample methods to acquire data into various tiers within data pyramid:
Depending on your target, various data acquisition or replication solutions and methods are available. While the diagram above focuses on some of the offering from within SAP BTP portfolio, there are others.
Let’s consider my data example – which approach should I use? I could use a combination of the techniques to load subset of data directly to particular target tier – for example, use SAP Data Intelligence to load years 2016 to 2021 directly to SAP HANA Cloud and then use Smart Data Access to push data (via SQL) from HANA Cloud to HDL IQ. For older years (2014-2015) I could just use HDL command line interface (hdlfscli) to load files to HDL Files.
In order to test some scenarios relevant for this blog, I decided to perform the bottom-up approach:
To understand how, let’s talk about…
… between various tiers of the pyramid.
There are two approaches shown on the diagram:
The “service” based approach is hypothetical at this stage as no such service is currently offered. One could of course implement such service – using either relevant APIs offered by individual layers or using data management solution like SAP Data Intelligence. I intend to enhance the blog at later stage to elaborate on this option, hence planting the “service” approach as placeholder here.
The “native SQL” based approach provides full coverage for moving data between all the tiers depicted on the pyramid:
Native SQL approach is quite powerful as it offers access to all capabilities of the respective tier. One can write procedures with dynamically calculated tiering parameters (like data “age” criteria) and combine with HANA Cloud built-in scheduler to automate execution.
In my scenario, to get a feel for various options, I used combination of following “native SQL” approaches to facilitate data movements between tiers:
Just one more section to complete the overall picture and I will explain in detail how I’ve done the above.
We collect data in order to consume it. In the context of data pyramid, data can be consumed from any tier directly using various methods – as per examples in the diagram:
While I can reach to each tier independently, in my case it would give me only partial view of the data. Therefore, I am most interested in ability to consume complete data set from one place. This is where two-step federation will help me:
In order to facilitate the access to data set spread across three-tiers of data, I created the following data model:
It is high time to elaborate a bit about the SQL on Files – as per blog Spotlight: SQL on Files (SAP HANA Cloud, Data Lake) and documentation SAP HANA Cloud, Data Lake Administration Guide for SQL on Files:
SQL on Files is a capability of the Data Lake Files service in SAP HANA Cloud, data lake that allows you to query files that contain structured data that are sitting in your data lake file container.
The referenced blog mentions several use cases mostly focused on accessing data which is still under exploration with a view to upload it to HANA Cloud HANA database or Data Lake Relational Engine to facilitate fast access.
In my scenario I decided to use HDL Files layer to persist portion of my “aged” data – mostly to illustrate the possibilities of SQL on Files, but also to provide additional option to persist data at the lowest possible cost where access to that data is mostly required on an ad-hoc basis – for example for regulatory reasons.
It is important to assess whether in your case use of HDL Files for the multi-tier data model does in fact make sense. Arguably, HDL IQ offers excellent performance (check the blog Estimating the costs of SAP HANA Cloud to estimate the costs), so why bother with HDL Files? The choice will naturally be yours, and my aim is to provide you with necessary information to make that decision.
Finally, let’s implement the data model and movements. Below is a write-up and further down you will find a recording showing elements of my work. In each section I have included jump links to the associated location in the video, so you can quickly access relevant visualisation.
Decided to use HDL Files Command Line Interface (hdlfscli) to upload data files from my storage to HDL Files. The tool can be obtained with Data Lake Relational Engine Client and provides a wrapper for the HDL Files REST APIs.
Refer to following reference material for details on how to set the connectivity up and how to use the tool:
Key activities:
Uploading first file:
> hdlfscli -config csm_dlake upload 2014-06-1_1-prices.csv TankerDataTiering/Prices/2014/06/2014-06-1_1-prices.csv {"Location":"https://xxxxxxxxxx.files.hdl.prod-eu10.hanacloud.ondemand.com/webhdfs/v1/TankerDataTiering/Prices/2014/06/2014-06-1_1-prices.csv"}
where csm_dlake refers to configuration created as per Set-up use of hdfscli with configuration file.
After all files have been uploaded, we can check the content in target file container:
> hdlfscli -config csm_dlake ls TankerDataTiering/Prices DIRECTORY <owner> <group> 0 777 2014 DIRECTORY <owner> <group> 0 777 2015 > hdlfscli -config csm_dlake ls TankerDataTiering/Prices/2014 DIRECTORY <owner> <group> 0 777 06 DIRECTORY <owner> <group> 0 777 07 DIRECTORY <owner> <group> 0 777 08 DIRECTORY <owner> <group> 0 777 09 DIRECTORY <owner> <group> 0 777 10 DIRECTORY <owner> <group> 0 777 11 DIRECTORY <owner> <group> 0 777 12 > hdlfscli -config csm_dlake ls TankerDataTiering/Prices/2014/06 FILE <owner> <group> 10485661 666 2014-06-1_1-prices.csv FILE <owner> <group> 10485663 666 2014-06-1_10-prices.csv FILE <owner> <group> 10485637 666 2014-06-1_11-prices.csv FILE <owner> <group> 10485631 666 2014-06-1_12-prices.csv FILE <owner> <group> 10485597 666 2014-06-1_13-prices.csv FILE <owner> <group> 10485646 666 2014-06-1_14-prices.csv …
Refer to the embedded video for visualisation – direct jump link to relevant video section.
Let me start with creation of the portion of my data model residing in HDL IQ:
At current stage, Cloud Application Programming Model (CAP) does not cover development of HANA Data Lake artefacts (nor does MTA deployment covers such artefacts). Therefore, I created these artefacts using SQL statements, but executed from HANA Cloud. As per SQL Statements for Data Lake Relational Engine (HANA DB-Managed):
Some Data Lake Relational Engine SQL statement can only be executed within the REMOTE_EXECUTE procedure. A few can be run either within the REMOTE_EXECUTE procedure or when directly connected to Data Lake Relational Engine. Check the Restriction section at the beginning of each SQL statement for guidance.
I have decided to use the REMOTE_EXECUTE method wherever possible and used selected SQL on Files statements as per SQL on Files for Data Lake Relational Engine (SAP HANA DB-Managed). Please note that all statements executed with REMOTE_EXECUTE must be wrapped with single quote characters. Every existing single quote must be escaped by adding additional single quote (refer to my code below).
To execute remote calls, I have granted my user a special role:
grant "HANA_SYSRDL#CG_ADMIN_ROLE" to <my-user>;
This role contains necessary object privilege:
Equipped with that role, I can now create my HDL IQ portion of the data model.
Step #1: Creating a Schema [SQL on Files]:
/* Step 1, Create a Schema in the Files Service. */ CALL SYSRDL#CG.REMOTE_EXECUTE(' CREATE SCHEMA TANKER_SOF IN FILES_SERVICE; ');
Step #2: Creating a SQL on Files Table [SQL on Files]
/* Step 2, Create a Table Under the Schema in the Files Service. */ CALL SYSRDL#CG.REMOTE_EXECUTE(' CREATE TABLE TANKER_SOF.SOF_PRICES ( YEAR integer, MONTH integer, DATE_TIME timestamp, STATION_UUID varchar(50), DIESEL decimal(20,5), E5 decimal(20,5), E10 decimal(20,5), DIESELCHANGE decimal(20,5), E5CHANGE decimal(20,5), E10CHANGE decimal(20,5) ) IN FILES_SERVICE; ');
Step #3: Creating a Virtual Table in Data Lake Relational Engine [SQL on Files]
/* Step 3, Create a Virtual Table in IQ that points to the Table in the Files Service. */ CALL SYSRDL#CG.REMOTE_EXECUTE(' CREATE EXISTING TABLE VT_SOF_PRICES AT ''SOF..TANKER_SOF.SOF_PRICES''; ');
Step #4: Add a Datasource to a SQL on Files Table [SQL on Files]
/* Step 4, declare the data source for the Table in the Files Service. */ CALL SYSRDL#CG.REMOTE_EXECUTE(' ALTER TABLE TANKER_SOF.SOF_PRICES IN FILES_SERVICE ADD DATASOURCE AS prices_dl ( YEAR FROM DIRECTORY $0, MONTH FROM DIRECTORY $1, DATE_TIME FROM COLUMN $0, STATION_UUID FROM COLUMN $1, DIESEL FROM COLUMN $2, E5 FROM COLUMN $3, E10 FROM COLUMN $4, DIESELCHANGE FROM COLUMN $5, E5CHANGE FROM COLUMN $6, E10CHANGE FROM COLUMN $7 ) csv(''hdlfs:///TankerDataTiering/Prices/'') delimited by '','' encoding ''utf_8''; ');
Note how data in the table is constructed:
YEAR FROM DIRECTORY $0, MONTH FROM DIRECTORY $1,
DATE_TIME FROM COLUMN $0, STATION_UUID FROM COLUMN $1, DIESEL FROM COLUMN $2, E5 FROM COLUMN $3, E10 FROM COLUMN $4, DIESELCHANGE FROM COLUMN $5, E5CHANGE FROM COLUMN $6, E10CHANGE FROM COLUMN $7
Sample content of the csv file – fields reflecting the list above in sequence:
2015-12-01 10:10:01.000,cff3ef54-61bf-4546-b7ad-3ad75537a04f,1.11900,1.32900,1.30900,1.00000,1.00000,1.00000, 2015-12-01 10:10:01.000,d02f97ff-9546-4895-90b7-a5abe4338ed1,1.09900,1.31900,1.29900,1.00000,1.00000,1.00000, 2015-12-01 10:10:01.000,d067c1a1-ab23-411f-b6bd-a2ffce1622b5,1.13900,1.38900,1.36900,1.00000,0.00000,0.00000, 2015-12-01 10:10:01.000,d07d7688-a4e1-472e-8f91-9b8c39624763,1.10900,1.32900,1.30900,1.00000,1.00000,1.00000,
I can check created objects using HDL IQ stored procedures. To do it via REMOTE_EXECUTE, I would need to create temporary table where the results of execution of the stored procedure are stored. And then I could query the temporary table – refer to example.
Or, I can execute those directly from HDL IQ (using SAP HANA Database Explorer or IQ Interactive Client):
Now, as final activity, I am creating HDL IQ table IQ_PRICES:
/* Create an HDL IQ Table IQ_PRICES. */ CALL SYSRDL#CG.REMOTE_EXECUTE(' CREATE TABLE IQ_PRICES ( YEAR integer, MONTH integer, DATE_TIME timestamp, STATION_UUID varchar(50), DIESEL decimal(20,5), E5 decimal(20,5), E10 decimal(20,5), DIESELCHANGE decimal(20,5), E5CHANGE decimal(20,5), E10CHANGE decimal(20,5) ); ');
I have now created all the artefacts in HDL IQ, including remote table for accessing HDL Files via SQL on Files.
Refer to the embedded video for visualisation – direct jump link to relevant video section.
For SAP HANA Cloud, I decided to use Cloud Application Programming Model (CAP) with SAP Business Application Studio. The key artefacts:
One preparation activity - to allow creation of virtual table via CAP as per Virtual Tables (.hdbvirtualtable), I need to grant a role to HDI container object owner:
grant create virtual table on remote source SYSRDL#CG_SOURCE to "<HDI container schema name>#OO";
(in my case, <HDI container schema name>#OO = TANKER_1#OO)
After first deployment, I have partitioned my HANA Cloud table (HC_PRICES) by YEAR with selected partitions using Native Storage Extension (NSE):
ALTER TABLE TANKER."tanker::HC_PRICES" PARTITION BY RANGE (YEAR) ( PARTITION 2022 <= VALUES < 9999, PARTITION VALUE = '2021', PARTITION VALUE = '2020' PAGE LOADABLE, PARTITION VALUE = '2019' PAGE LOADABLE, PARTITION OTHERS PAGE LOADABLE );
I had to add ALTER object privilege in a role defined within CAP model. That role was assigned to operational user who could perform the partitioning operation.
Refer to the embedded video for visualisation – direct jump link to relevant video section.
I will now run an aggregation query against calculation view with pruning defined – each of the queries below retrieves data set exclusively from HC (YEAR=2020), HDL IQ (YEAR=2017)or HDL Files (YEAR=2015). Next to each execution I have provided some runtime results (without any tuning):
Query against CalcView with pruning | Runtime results |
/* Execute CV query for data set in HANA Cloud only */ | Client elapsed time: 678.0 ms Statement execute time: 676.3 ms elapsed time Peak memory consumed: 24.00 MB |
/* Execute CV query for data set in HDL IQ only */ select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION" where year in ('2017') group by year, month, tier order by year, month, tier; | Client elapsed time: 2.262 s Statement execute time: 2.260 s elapsed time Peak memory consumed: 1.443 MB |
/* Execute CV query for data set in HDL Files only */ select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION" where year in ('2014') group by year, month, tier order by year, month, tier; | Client elapsed time: 30.10 s Statement execute time: 30.10 s elapsed time Peak memory consumed: 1.443 MB |
As you can see and as expected, statement execution times grow as we travel down the data pyramid. That is because we explicitly direct the query only to the union node which contains particular “data tier”.
It is also important to note that performance of the HDL – both IQ and Files – is affected by configured capacity. The measurements above were obtained with HDL configured with 1 Coordinator node (2 vCPUs) and 1 Worker node (8 vCPUs). Experiments with higher CPU allocation have returned better results – for example increase of Worker node capacity to 16 vCPUs reduced statement execute time by 40% for HDL IQ layer and 27% for HDL Files.
Now, let’s repeat the test with calculation view without union node pruning:
Query against CalcView without pruning | Runtime results |
/* Execute CV query for data set in HANA Cloud only */ select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION_NO_PRUNING" where year in ('2020') group by year, month, tier order by year, month, tier; | Client elapsed time: 34.77 s Statement execute time: 34.77 s elapsed time Peak memory consumed: 24.20 MB |
/* Execute CV query for data set in HDL IQ only */ select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION_NO_PRUNING" where year in ('2017') group by year, month, tier order by year, month, tier; | Client elapsed time: 36.37 s Statement execute time: 36.37 s elapsed time Peak memory consumed: 2.422 MB |
/* Execute CV query for data set in HDL Files only */ select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION_NO_PRUNING" where year in ('2014') group by year, month, tier order by year, month, tier; | Client elapsed time: 32.55 s Statement execute time: 32.54 s elapsed time Peak memory consumed: 2.424 MB |
Without pruning, queries are always executed against all union nodes and the slowest one dictates overall response time.
The above simple test shows importance of using optimisation techniques like union node pruning when combining any data sources, but especially when the data sources have very different performance characteristics.
Regardless of performance differences, we can now access data set distributed across various tiers of the data pyramid – and do that in a manner completely transparent to consumers of that information. We can, of course, add filter which would by default exclude data from bottom tier unless explicitly requested.
While the pruning technique described above works well for avoidance of execution of queries against particular union node, it does so only where the query criteria contain only YEAR assigned to one union node. For example, if my query includes YEAR>=2017, the query will be executed against every union node.
This raises another question – what happens when data is – albeit briefly – duplicated across tiers while I am moving data between them? How to avoid fetching data twice?
Before I’ll explain my approach, let’s talk about data movements.
While awaiting readiness of Data Tiering Service, I’ve decided to use “native SQL” approach, for example:
Irrespective of the method chosen, the movement itself may not support distributed transaction consistency. When moving data between two tiers, following activities occur:
Unless the two commits above can be executed as part of the same transaction (using two-phase commit), then data subset A may exist in both Tier#1 and Tier#2 at the same time (until deletion from Tier#1 comes into effect). As it stands today, Smart Data Access does not provide distributed transaction consistency support for any sources. You can execute the following commands in the context of a single transaction in HANA Cloud:
Copy data from HANA Cloud to HDL IQ:
insert into "tanker::VT_IQ_PRICES" (year, month, date_time, station_uuid, diesel, e5, e10, dieselchange, e5change, e10change) ( select year, month, date_time, station_uuid, diesel, e5, e10, dieselchange, e5change, e10change from "tanker::HC_PRICES" where year = 2019 );
Delete data from HANA Cloud
delete from "tanker::HC_PRICES" where year = 2019;
Commit transaction.
Local HANA Cloud commit or rollback will trigger corresponding remote HDL commit or rollback based on the 'best effort' approach. It means that transactional consistency may seem to work in most cases, but in worst case we can experience data loss – for example:
Therefore, data movement needs to be a 3-step approach:
You can check capabilities supported by your remote source as per List Current Properties for a Remote Source.
It is worth mentioning that the method I presented here for data movement from HANA database to HDL IQ is simple, but may prove problematic where data sets are very large. For example, I had to break up the INSERT into by-month basis in order to avoid hitting statement memory limit (on my 60GB HANA instance). This is where using EXPORT (for example to HDL Files) from HANA database and then LOAD on HDL IQ from HDL Files could prove beneficial.
Union node pruning as described above is useful for performance reasons, but does not address an important aspect of data consistency. As illustrated above, when we start moving data between tiers, we will have – however briefly – the problem of data duplication.
Therefore I have enhanced the pruning concept to ensure that only one data tier is queried for particular data subset. I have created configuration table CFG_TIER_PRICES with following content:
YEAR | TIER |
2014 | SOF_TIER |
2015 | SOF_TIER |
2016 | IQ_TIER |
2017 | IQ_TIER |
2018 | IQ_TIER |
2019 | HC_TIER |
2020 | HC_TIER |
2021 | HC_TIER |
I then use this table in each union node of my calculation view to restrict selection from particular data source to data within years that should be sourced from particular tier. In the screenshot below, I am showing the join definition within node IQ_TIER:
Same logic applies to other union nodes, with the only difference being value of TIER in the filter.
In my movement scenario, I introduce additional step:
upsert "tanker::CFG_TIER_PRICES" (YEAR, TIER) values (2019, 'IQ_TIER') where year = 2019;
As a result, queries for data in YEAR 2019 will now be returned by IQ_TIER node only (and not HC_TIER before the change).
Proceed to Step#3.
As a result, query for YEAR=2019 will only consider data from IQ_TIER, that is VT_IQ_PRICES table.
Result of query before and after the data move:
/* Check results of query using CalcView */ select year, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION" group by year, tier order by year, tier;
Before movement | ---> | After movement | ||||||
YEAR | TIER | COUNT(*) | AVG(E5) | YEAR | TIER | COUNT(*) | AVG(E5) | |
2014 | SOF_TIER | 16,264,865 | 1.4839 | 2014 | SOF_TIER | 16,264,865 | 1.4839 | |
2015 | SOF_TIER | 35,727,313 | 1.3469 | 2015 | SOF_TIER | 35,727,313 | 1.3469 | |
2016 | IQ_TIER | 44,340,894 | 1.2838 | 2016 | IQ_TIER | 44,340,894 | 1.2838 | |
2017 | IQ_TIER | 58,402,045 | 1.3529 | 2017 | IQ_TIER | 58,402,045 | 1.3529 | |
2018 | IQ_TIER | 65,544,845 | 1.4370 | 2018 | IQ_TIER | 65,544,845 | 1.4370 | |
2019 | HC_TIER | 82,326,042 | 1.4156 | 2019 | IQ_TIER | 82,326,042 | 1.4156 | |
2020 | HC_TIER | 100,117,404 | 1.2713 | 2020 | HC_TIER | 100,117,404 | 1.2713 | |
2021 | HC_TIER | 90,051,867 | 1.5308 | 2021 | HC_TIER | 90,051,867 | 1.5308 |
The video below shows key activities I described above.
The video below provides a complete walkthrough the key activities I described above (since the overall video is over 30 mins long, provided jump links to go straight to particular section):
Hope you found my musings informative. My intent was to provide food for thought and give you some ideas and share existing capabilities of the SAP HANA Cloud and SAP BTP. The key messages I would like you to take away:
You can see how the flexible access to data cross the pyramid can help in building sophisticated scenario in the blog SAP BTP Data & Analytics Showcase – Machine Learning via Python in SAP Data Warehouse Cloud by wei.han7.
Refer to these useful references for additional information:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
24 | |
11 | |
10 | |
9 | |
9 | |
7 | |
7 | |
6 | |
6 | |
6 |