Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
jacekklatt
Product and Topic Expert
Product and Topic Expert

This blog is about data management, data tiering,  data temperature, data archiving… A lot of topics, or labels really, as in essence they all have one thing at its core – making sure necessary data is available at the right time and at the right cost.

In the good tradition of blog series developed by SAP HANA Database & Analytics, Cross Solution Management team, I wanted to present a sample end-to-end use case including all relevant components of such case in as much detail as necessary.

I will use sample data scenario to discuss the following topics:

  • data tiers explained,
  • data acquisition to various tiers,
  • data consumption from various tiers,
  • data movement between tiers.

Data storage/scale pyramid


If you have watched the TechEd'21 session Examine the Cloud-Native Features of SAP HANA Cloud and Discover the Value [DAT105], and more recently the TechEd'22 session SAP HANA Cloud: Data Lake and Data Tiering Overview [DA108], you would have heard the term Data Pyramid. If not, you would have most likely seen a picture like the one below in the context of data archiving, data tiering, data temperature, or cost-effective data storage/persistency.
(Please do not confuse this with other concepts like DIKW or data science pyramid.)

The idea is simple enough – place your data in a tier which will provide the required capabilities (like access and performance) at the most cost-effective way. So, use your in-memory tier for data used frequently and/or with critical latency requirements. Then push data down the pyramid as needed.

In this blog, I would like to share my thoughts on two approaches involving the data and the pyramid:

  • Data tiering
    A process where data is moved between tiers according to defined rules.
  • Data placement
    A process where data is placed in a particular tier during acquisition process.


The difference between the two is rather subtle – in the end, we are looking at certain rules defining which data should reside in which tier. In the former case, we apply the rules to data already in the pyramid to demote or promote it within the pyramid. In the latter, we acquire data from external sources and decide on the fly, in which tier the data will be placed in the pyramid. I hope the reason I draw that distinction will become clearer a little bit later.

The tiers

Quick run through the tiers in our pyramid and associated terminology. The pyramid depicted above is built using following components (wording to large degree sourced from SAP HANA Cloud, Data Lake terminology:

Components of SAP HANA Cloud:

  • SAP HANA Cloud, SAP HANA Database -> provides SAP HANA in-memory and Native Storage Extension capabilities.
    Will use the acronym HC in this blog.
  • SAP HANA Cloud, Data Lake -> with following capabilities:
    • Data lake IQ (renamed to Data Lake Relational Engine) - provides high-performance SQL analysis of large storage volumes. It's based on on-premise SAP IQ and has been referred to as SAP HANA Cloud, data lake IQ.
      [Update QRC 1/2022] The capability has been renamed to SAP HANA Cloud, Data Lake Relational Engine.
      Will use the acronym HDL IQ in this blog.
    • Data Lake Files - provides managed access to structured, semi-structured and unstructured data stored as files in the data lake. Data lake Files contains file containers for data. Also referred to as SAP HANA Cloud, data lake Files.
      Will use the acronym HDL Files in this blog.
  • External cloud storage capabilities of supported Hyperscalers.

For completeness, it should be noted that several of the concepts and capabilities described in this blog are also available for SAP HANA on-premise deployments, but the focus here will be on SAP HANA Cloud.

In my scenario, I will be using SAP HANA Cloud, Data Lake Relational Engine (HANA DB-managed). Refer to Ways to Create a Data Lake to learn about this and alternative (standalone) deployment.

A word about my data set and use case

As in many of our End-to-End Data & Analytics Scenarios, I used the Tankerkönig data set. While this blog described the process of acquiring current data via API (using SAP Data Intelligence), I focused on historical data set as published here. The historical data set I used has following volumes:

Year20142015201620172018201920202021 (*)Total
Record count16,264,86535,727,31344,340,89458,402,04565,544,84582,326,042100,117,40490,051,867492,775,275


(*) Data from year 2021 covers up to Oct 21st.

The data is available as set of csv files containing prices across all stations and several samples a day organised in following manner:


So, we have (number of days in a month x) files contained in a Month folder, which in turn is contained in Year folder. This is quite handy – will explain later why.

I want to distribute the data – according to its age – across the tiers:

  • HANA Cloud column loadable (HC),
  • HANA Cloud page loadable (HC NSE),
  • HANA Data Lake IQ (aka HANA Data Lake Relational Engine) (HDL IQ),
  • HANA Data Lake Files (HDL Files).

My data set and its tier classification represented on the data pyramid:


In case of HANA Cloud and Native Storage Extension, I have two options with regards to splitting data:

  • create dedicated tables for column loadable and page loadable portions respectively, or
  • create single table with partitions and alter partition load characteristic.

For my exercise I decided for 2nd option – that is maintain single table and partitioned it using range partitioning by YEAR. That way I can easily change Tier of the whole partition without a need to transfer data.

There may be scenarios where 1st option might be preferable – for example when you want to restrict access to data in NSE tier (like block changes) and it can be easier done with separate tables. In such scenario, change of the tier of data subset will require transfer of data from one table to another.

Data acquisition, movement, and consumption


In the context of the data pyramid and various tiers, it is important to consider capabilities and requirements with regards to data acquisition, data movement (between tiers), and data consumption:

Data acquisition

Diagram below shows sample methods to acquire data into various tiers within data pyramid:


Depending on your target, various data acquisition or replication solutions and methods are available. While the diagram above focuses on some of the offering from within SAP BTP portfolio, there are others.

Let’s consider my data example – which approach should I use? I could use a combination of the techniques to load subset of data directly to particular target tier – for example, use SAP Data Intelligence to load years 2016 to 2021 directly to SAP HANA Cloud and then use Smart Data Access to push data (via SQL) from HANA Cloud to HDL IQ. For older years (2014-2015) I could just use HDL command line interface (hdlfscli) to load files to HDL Files.

In order to test some scenarios relevant for this blog, I decided to perform the bottom-up approach:

  • load all data to HDL Files,
  • load data to HDL IQ and HANA Cloud from HDL Files.


To understand how, let’s talk about…

Data movement

… between various tiers of the pyramid.


There are two approaches shown on the diagram:

  • using native SQL capabilities of each of the tier, enhanced with SQL on Files (more on that capability later),
  • using service based approach.

The “service” based approach is hypothetical at this stage as no such service is currently offered. One could of course implement such service – using either relevant APIs offered by individual layers or using data management solution like SAP Data Intelligence. I intend to enhance the blog at later stage to elaborate on this option, hence planting the “service” approach as placeholder here.

The “native SQL” based approach provides full coverage for moving data between all the tiers depicted on the pyramid:

Native SQL approach is quite powerful as it offers access to all capabilities of the respective tier. One can write procedures with dynamically calculated tiering parameters (like data “age” criteria) and combine with HANA Cloud built-in scheduler to automate execution.

In my scenario, to get a feel for various options, I used combination of following “native SQL” approaches to facilitate data movements between tiers:

  • Initial load of complete data set (years 2014-2021) to HDL Files using command line interface (hdlfscli).
  • Load of year 2016 from HDL Files to HDL IQ using LOAD TABLE SQL statement.
  • Load of years 2017-2018 from HDL Files to HDL IQ using SQL on Files.
  • Load of years 2019-2021 from HDL Files via HDL IQ using SQL on Files to HANA Cloud (into table already partitioned by year with 2019-2020 configured to use NSE.
  • Subsequently, I decided that as part of good data management practices, I can perform following tier re-classifications:
    • YEAR 2019 is now old enough to be pushed to HDL IQ
      -> used INSERT … as … SELECT to perform the data movement.
    • YEAR 2016 is no longer needed for regular analytical tasks and therefore can be moved to HDL Files
      -> used UNLOAD to perform the data movement.

Just one more section to complete the overall picture and I will explain in detail how I’ve done the above.

Data consumption

We collect data in order to consume it. In the context of data pyramid, data can be consumed from any tier directly using various methods – as per examples in the diagram:

While I can reach to each tier independently, in my case it would give me only partial view of the data. Therefore, I am most interested in ability to consume complete data set from one place. This is where two-step federation will help me:

  • Federate data in HDL Files to HDL IQ using SQL on Files.
  • Federate data in HDL IQ (including already federated data from HDL Files) to HANA Cloud.

My multi-tier data model

In order to facilitate the access to data set spread across three-tiers of data, I created the following data model:


It is high time to elaborate a bit about the SQL on Files – as per blog Spotlight: SQL on Files (SAP HANA Cloud, Data Lake) and documentation SAP HANA Cloud, Data Lake Administration Guide for SQL on Files:

SQL on Files is a capability of the Data Lake Files service in SAP HANA Cloud, data lake that allows you to query files that contain structured data that are sitting in your data lake file container.

The referenced blog mentions several use cases mostly focused on accessing data which is still under exploration with a view to upload it to HANA Cloud HANA database or Data Lake Relational Engine to facilitate fast access.

In my scenario I decided to use HDL Files layer to persist portion of my “aged” data – mostly to illustrate the possibilities of SQL on Files, but also to provide additional option to persist data at the lowest possible cost where access to that data is mostly required on an ad-hoc basis – for example for regulatory reasons.

It is important to assess whether in your case use of HDL Files for the multi-tier data model does in fact make sense. Arguably, HDL IQ offers excellent performance (check the blog Estimating the costs of SAP HANA Cloud to estimate the costs), so why bother with HDL Files? The choice will naturally be yours, and my aim is to provide you with necessary information to make that decision.

Finally, let’s implement the data model and movements. Below is a write-up and further down you will find a recording showing elements of my work. In each section I have included jump links to the associated location in the video, so you can quickly access relevant visualisation.

Starting with HDL Files

Decided to use HDL Files Command Line Interface (hdlfscli) to upload data files from my storage to HDL Files. The tool can be obtained with Data Lake Relational Engine Client and provides a wrapper for the HDL Files REST APIs.

Refer to following reference material for details on how to set the connectivity up and how to use the tool:

Key activities:

Uploading first file:

> hdlfscli -config csm_dlake upload 2014-06-1_1-prices.csv TankerDataTiering/Prices/2014/06/2014-06-1_1-prices.csv

{"Location":"https://xxxxxxxxxx.files.hdl.prod-eu10.hanacloud.ondemand.com/webhdfs/v1/TankerDataTiering/Prices/2014/06/2014-06-1_1-prices.csv"}


where csm_dlake refers to configuration created as per Set-up use of hdfscli with configuration file.

After all files have been uploaded, we can check the content in target file container:

> hdlfscli -config csm_dlake ls TankerDataTiering/Prices

DIRECTORY <owner> <group>            0 777  2014
DIRECTORY <owner> <group>            0 777  2015

> hdlfscli -config csm_dlake ls TankerDataTiering/Prices/2014

DIRECTORY <owner> <group>            0 777  06
DIRECTORY <owner> <group>            0 777  07
DIRECTORY <owner> <group>            0 777  08
DIRECTORY <owner> <group>            0 777  09
DIRECTORY <owner> <group>            0 777  10
DIRECTORY <owner> <group>            0 777  11
DIRECTORY <owner> <group>            0 777  12

> hdlfscli -config csm_dlake ls TankerDataTiering/Prices/2014/06

FILE      <owner> <group>     10485661 666  2014-06-1_1-prices.csv
FILE      <owner> <group>     10485663 666  2014-06-1_10-prices.csv
FILE      <owner> <group>     10485637 666  2014-06-1_11-prices.csv
FILE      <owner> <group>     10485631 666  2014-06-1_12-prices.csv
FILE      <owner> <group>     10485597 666  2014-06-1_13-prices.csv
FILE      <owner> <group>     10485646 666  2014-06-1_14-prices.csv
…


Refer to the embedded video for visualisation – direct jump link to relevant video section.

Federation from HDL Files to HDL IQ (using SQL on Files)

Let me start with creation of the portion of my data model residing in HDL IQ:


At current stage, Cloud Application Programming Model (CAP) does not cover development of HANA Data Lake artefacts (nor does MTA deployment covers such artefacts). Therefore, I created these artefacts using SQL statements, but executed from HANA Cloud. As per SQL Statements for Data Lake Relational Engine (HANA DB-Managed):

Some Data Lake Relational Engine SQL statement can only be executed within the REMOTE_EXECUTE procedure. A few can be run either within the REMOTE_EXECUTE procedure or when directly connected to Data Lake Relational Engine. Check the Restriction section at the beginning of each SQL statement for guidance.


I have decided to use the REMOTE_EXECUTE method wherever possible and used selected SQL on Files statements as per SQL on Files for Data Lake Relational Engine (SAP HANA DB-Managed). Please note that all statements executed with REMOTE_EXECUTE must be wrapped with single quote characters. Every existing single quote must be escaped by adding additional single quote (refer to my code below).

To execute remote calls, I have granted my user a special role:

grant "HANA_SYSRDL#CG_ADMIN_ROLE" to <my-user>;


This role contains necessary object privilege:

Equipped with that role, I can now create my HDL IQ portion of the data model.

Step #1: Creating a Schema [SQL on Files]:

/* Step 1, Create a Schema in the Files Service. */

CALL SYSRDL#CG.REMOTE_EXECUTE('
  CREATE SCHEMA TANKER_SOF IN FILES_SERVICE;
');


Step #2: Creating a SQL on Files Table [SQL on Files]

/* Step 2, Create a Table Under the Schema in the Files Service. */

CALL SYSRDL#CG.REMOTE_EXECUTE('
  CREATE TABLE TANKER_SOF.SOF_PRICES (
    YEAR integer,
    MONTH integer,
    DATE_TIME timestamp,
    STATION_UUID varchar(50),
    DIESEL decimal(20,5),
    E5 decimal(20,5),
    E10 decimal(20,5),
    DIESELCHANGE decimal(20,5),
    E5CHANGE decimal(20,5),
    E10CHANGE decimal(20,5)
    ) IN FILES_SERVICE;
');


Step #3: Creating a Virtual Table in Data Lake Relational Engine [SQL on Files]

/* Step 3, Create a Virtual Table in IQ that points to the Table in the Files Service. */

CALL SYSRDL#CG.REMOTE_EXECUTE('
  CREATE EXISTING TABLE VT_SOF_PRICES AT ''SOF..TANKER_SOF.SOF_PRICES'';
');


Step #4: Add a Datasource to a SQL on Files Table [SQL on Files]

/* Step 4, declare the data source for the Table in the Files Service. */

CALL SYSRDL#CG.REMOTE_EXECUTE('
  ALTER TABLE TANKER_SOF.SOF_PRICES IN FILES_SERVICE
    ADD DATASOURCE AS prices_dl (
      YEAR FROM DIRECTORY $0,
      MONTH FROM DIRECTORY $1,
      DATE_TIME FROM COLUMN $0,
      STATION_UUID FROM COLUMN $1,
      DIESEL FROM COLUMN $2,
      E5 FROM COLUMN $3,
      E10 FROM COLUMN $4,
      DIESELCHANGE FROM COLUMN $5,
      E5CHANGE FROM COLUMN $6,
      E10CHANGE FROM COLUMN $7
    )
    csv(''hdlfs:///TankerDataTiering/Prices/'') delimited by '','' encoding ''utf_8'';
');


Note how data in the table is constructed:

  • From HDL Files directory structure:
YEAR FROM DIRECTORY $0,
MONTH FROM DIRECTORY $1,
  • From the csv files:
DATE_TIME FROM COLUMN $0,
STATION_UUID FROM COLUMN $1,
DIESEL FROM COLUMN $2,
E5 FROM COLUMN $3,
E10 FROM COLUMN $4,
DIESELCHANGE FROM COLUMN $5,
E5CHANGE FROM COLUMN $6,
E10CHANGE FROM COLUMN $7​


Sample content of the csv file – fields reflecting the list above in sequence:

2015-12-01 10:10:01.000,cff3ef54-61bf-4546-b7ad-3ad75537a04f,1.11900,1.32900,1.30900,1.00000,1.00000,1.00000,
2015-12-01 10:10:01.000,d02f97ff-9546-4895-90b7-a5abe4338ed1,1.09900,1.31900,1.29900,1.00000,1.00000,1.00000,
2015-12-01 10:10:01.000,d067c1a1-ab23-411f-b6bd-a2ffce1622b5,1.13900,1.38900,1.36900,1.00000,0.00000,0.00000,
2015-12-01 10:10:01.000,d07d7688-a4e1-472e-8f91-9b8c39624763,1.10900,1.32900,1.30900,1.00000,1.00000,1.00000,


I can check created objects using HDL IQ stored procedures. To do it via REMOTE_EXECUTE, I would need to create temporary table where the results of execution of the stored procedure are stored. And then I could query the temporary table – refer to example.

Or, I can execute those directly from HDL IQ (using SAP HANA Database Explorer or IQ Interactive Client):

Now, as final activity, I am creating HDL IQ table IQ_PRICES:

/* Create an HDL IQ Table IQ_PRICES. */

CALL SYSRDL#CG.REMOTE_EXECUTE('
	CREATE TABLE IQ_PRICES (
		YEAR integer,
		MONTH integer,
		DATE_TIME timestamp,
		STATION_UUID varchar(50),
		DIESEL decimal(20,5),
		E5 decimal(20,5),
		E10 decimal(20,5),
		DIESELCHANGE decimal(20,5),
		E5CHANGE decimal(20,5),
		E10CHANGE decimal(20,5)
	);
');


I have now created all the artefacts in HDL IQ, including remote table for accessing HDL Files via SQL on Files.

Refer to the embedded video for visualisation – direct jump link to relevant video section.

Creation of HANA Cloud artefacts

For SAP HANA Cloud, I decided to use Cloud Application Programming Model (CAP) with SAP Business Application Studio. The key artefacts:

  • Tables
    • HC_PRICES > table for HANA Cloud tier containing the PRICES data set
    • CFG_TIER_PRICES > table for mapping of data tier to data subsets
  • Virtual tables
    • VT_IQ_PRICES > virtual table for HDL IQ table with PRICES data set
    • VT_IQ_SOF_PRICES > virtual table for HDL Files (via HDL IQ) table with PRICES data set
  • SQL Views
    • V_CFG_PRICES_PRUNING > view built on top of CFG_TIER_PRICES to be used for union node pruning
  • Calculation views
    • CV_PRICES_UNION > union view across all tiers for PRICES data set (with union node pruning)
    • CV_PRICES_UNION_NO PRUNING > as above, but without union node pruning (for comparison)

One preparation activity - to allow creation of virtual table via CAP as per Virtual Tables (.hdbvirtualtable), I need to grant a role to HDI container object owner:

grant create virtual table on remote source SYSRDL#CG_SOURCE to "<HDI container schema name>#OO";


(in my case, <HDI container schema name>#OO = TANKER_1#OO)

After first deployment, I have partitioned my HANA Cloud table (HC_PRICES) by YEAR with selected partitions using Native Storage Extension (NSE):

ALTER TABLE  TANKER."tanker::HC_PRICES"
PARTITION BY RANGE (YEAR) (
	PARTITION 2022 <= VALUES < 9999,
	PARTITION VALUE = '2021',
	PARTITION VALUE = '2020' PAGE LOADABLE,
	PARTITION VALUE = '2019' PAGE LOADABLE,
	PARTITION OTHERS PAGE LOADABLE
);


I had to add ALTER object privilege in a role defined within CAP model. That role was assigned to operational user who could perform the partitioning operation.

Refer to the embedded video for visualisation – direct jump link to relevant video section.

Query performance and effectiveness of pruning


I will now run an aggregation query against calculation view with pruning defined – each of the queries below retrieves data set exclusively from HC (YEAR=2020), HDL IQ (YEAR=2017)or HDL Files (YEAR=2015). Next to each execution I have provided some runtime results (without any tuning):

Query against CalcView with pruningRuntime results

/* Execute CV query for data set in HANA Cloud only */
select year, month, tier, count(*), avg(e5) from "tanker::CV_PRICES_UNION" where year in ('2020') group by year, month, tier order by year, month, tier;


Client elapsed time:    678.0 ms

Statement execute time: 676.3 ms elapsed time

Peak memory consumed:   24.00 MB

/* Execute CV query for data set in HDL IQ only */

select year, month, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION"
where year in ('2017')
group by year, month, tier
order by year, month, tier;


Client elapsed time:    2.262 s

Statement execute time: 2.260 s elapsed time

Peak memory consumed:   1.443 MB

/* Execute CV query for data set in HDL Files only */

select year, month, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION"
where year in ('2014')
group by year, month, tier
order by year, month, tier;


Client elapsed time:    30.10 s

Statement execute time: 30.10 s elapsed time

Peak memory consumed:   1.443 MB


As you can see and as expected, statement execution times grow as we travel down the data pyramid. That is because we explicitly direct the query only to the union node which contains particular “data tier”.
It is also important to note that performance of the HDL – both IQ and Files – is affected by configured capacity. The measurements above were obtained with HDL configured with 1 Coordinator node (2 vCPUs) and 1 Worker node (8 vCPUs). Experiments with higher CPU allocation have returned better results – for example increase of Worker node capacity to 16 vCPUs reduced statement execute time by 40% for HDL IQ layer and 27% for HDL Files.

Now, let’s repeat the test with calculation view without union node pruning:

Query against CalcView without pruningRuntime results

/* Execute CV query for data set in HANA Cloud only */

select year, month, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION_NO_PRUNING"
where year in ('2020')
group by year, month, tier 
order by year, month, tier;


Client elapsed time:    34.77 s

Statement execute time: 34.77 s elapsed time

Peak memory consumed:   24.20 MB

/* Execute CV query for data set in HDL IQ only */

select year, month, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION_NO_PRUNING"
where year in ('2017')
group by year, month, tier 
order by year, month, tier;


Client elapsed time:    36.37 s

Statement execute time: 36.37 s elapsed time

Peak memory consumed:   2.422 MB

/* Execute CV query for data set in HDL Files only */

select year, month, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION_NO_PRUNING"
where year in ('2014')
group by year, month, tier 
order by year, month, tier;


Client elapsed time:    32.55 s

Statement execute time: 32.54 s elapsed time

Peak memory consumed:   2.424 MB


Without pruning, queries are always executed against all union nodes and the slowest one dictates overall response time.

The above simple test shows importance of using optimisation techniques like union node pruning when combining any data sources, but especially when the data sources have very different performance characteristics.

Regardless of performance differences, we can now access data set distributed across various tiers of the data pyramid – and do that in a manner completely transparent to consumers of that information. We can, of course, add filter which would by default exclude data from bottom tier unless explicitly requested.

While the pruning technique described above works well for avoidance of execution of queries against particular union node, it does so only where the query criteria contain only YEAR assigned to one union node. For example, if my query includes YEAR>=2017, the query will be executed against every union node.

This raises another question – what happens when data is – albeit briefly – duplicated across tiers while I am moving data between them? How to avoid fetching data twice?

Before I’ll explain my approach, let’s talk about data movements.

Moving data


While awaiting readiness of Data Tiering Service, I’ve decided to use “native SQL” approach, for example:

  1. From HANA Cloud to HDL IQ: INSERT data to HDL IQ via virtual table using SELECT from HANA Cloud table.
  2. From HDL IQ to HDL Files: use UNLOAD to copy data from HDL IQ table to files in HDL Files.

Irrespective of the method chosen, the movement itself may not support distributed transaction consistency. When moving data between two tiers, following activities occur:

  1. Data subset A is extracted from Tier#1 and copied to Tier#2.
  2. Copy is committed in Tier#2.
  3. Data is deleted from Tier#1.
  4. Data deletion is committed in Tier#1.

Unless the two commits above can be executed as part of the same transaction (using two-phase commit), then data subset A may exist in both Tier#1 and Tier#2 at the same time (until deletion from Tier#1 comes into effect). As it stands today, Smart Data Access does not provide distributed transaction consistency support for any sources. You can execute the following commands in the context of a single transaction in HANA Cloud:

Copy data from HANA Cloud to HDL IQ:

insert into "tanker::VT_IQ_PRICES" 
 (year, month, date_time, station_uuid, diesel, e5, e10, dieselchange, e5change, e10change) 
 (
   select year, month, date_time, station_uuid, diesel, e5, e10, dieselchange, e5change, e10change
   from "tanker::HC_PRICES"
   where year = 2019
 );​

Delete data from HANA Cloud

delete from "tanker::HC_PRICES"
where year = 2019;​

Commit transaction.

Local HANA Cloud commit or rollback will trigger corresponding remote HDL commit or rollback based on the 'best effort' approach. It means that transactional consistency may seem to work in most cases, but in worst case we can experience data loss – for example:

  • Remote (HDL IQ) commit failed, but local (HANA Cloud) succeeded.
  • In such scenario, you have deleted data in HANA Cloud, but it has not been successfully copied to HDL IQ…

Therefore, data movement needs to be a 3-step approach:

  1. Copy data between tiers.
  2. Validate data has been successfully copied and change classification of data tier in your configuration/pruning table.
  3. Delete data from original tier.


You can check capabilities supported by your remote source as per List Current Properties for a Remote Source.

It is worth mentioning that the method I presented here for data movement from HANA database to HDL IQ is simple, but may prove problematic where data sets are very large. For example, I had to break up the INSERT into by-month basis in order to avoid hitting statement memory limit (on my 60GB HANA instance). This is where using EXPORT (for example to HDL Files) from HANA database and then LOAD on HDL IQ from HDL Files could prove beneficial.

Changing visibility of data

Union node pruning as described above is useful for performance reasons, but does not address an important aspect of data consistency. As illustrated above, when we start moving data between tiers, we will have – however briefly – the problem of data duplication.

Therefore I have enhanced the pruning concept to ensure that only one data tier is queried for particular data subset. I have created configuration table CFG_TIER_PRICES with following content:

YEARTIER
2014SOF_TIER
2015SOF_TIER
2016IQ_TIER
2017IQ_TIER
2018IQ_TIER
2019HC_TIER
2020HC_TIER
2021HC_TIER


I then use this table in each union node of my calculation view to restrict selection from particular data source to data within years that should be sourced from particular tier. In the screenshot below, I am showing the join definition within node IQ_TIER:

  • filter data in table tanker::CFG_TIER_PRICES on column TIER = ‘IQ_TIER’, and
  • join tables tanker::CFG_TIER_PRICES and tanker::VT_IQ_PRICES on column YEAR


Same logic applies to other union nodes, with the only difference being value of TIER in the filter.

In my movement scenario, I introduce additional step:

  • Copy data from HANA Cloud to HDL IQ table (for YEAR=2019).
  • Validate data has been copied.
    • If Yes -> update configuration table
upsert "tanker::CFG_TIER_PRICES" (YEAR, TIER)
values (2019, 'IQ_TIER')
where year = 2019;​


As a result, queries for data in YEAR 2019 will now be returned by IQ_TIER node only (and not HC_TIER before the change).
Proceed to Step#3.

  • If No, identify the reasons and repeat Step#1.
  • Delete data from HANA Cloud table.

As a result, query for YEAR=2019 will only consider data from IQ_TIER, that is VT_IQ_PRICES table.

Result of query before and after the data move:

/* Check results of query using CalcView */

select year, tier, count(*), avg(e5)
from "tanker::CV_PRICES_UNION"
group by year, tier
order by year, tier; 
Before movement--->After movement
YEARTIERCOUNT(*)AVG(E5) YEARTIERCOUNT(*)AVG(E5)
2014SOF_TIER16,264,8651.4839 2014SOF_TIER16,264,8651.4839
2015SOF_TIER35,727,3131.3469 2015SOF_TIER35,727,3131.3469
2016IQ_TIER44,340,8941.2838 2016IQ_TIER44,340,8941.2838
2017IQ_TIER58,402,0451.3529 2017IQ_TIER58,402,0451.3529
2018IQ_TIER65,544,8451.4370 2018IQ_TIER65,544,8451.4370
2019HC_TIER82,326,0421.4156 2019IQ_TIER82,326,0421.4156
2020HC_TIER100,117,4041.2713 2020HC_TIER100,117,4041.2713
2021HC_TIER90,051,8671.5308 2021HC_TIER90,051,8671.5308

 

Summary

The video below shows key activities I described above.

The video below provides a complete walkthrough the key activities I described above (since the overall video is over 30 mins long, provided jump links to go straight to particular section):

  1. Introduction: Data Pyramid and patterns of acquisition, movement and consumption (link)
  2. My scenario – Data and Data Model (direct jump link)
  3. Data Model implementation – SAP HANA Cloud, HANA database (direct jump link)
  4. Data Model implementation – SAP HANA Cloud, Data Lake (direct jump link)
  5. Working with SAP HANA Cloud, Data Lake Files (direct jump link)
  6. Summary (direct jump link)

Hope you found my musings informative. My intent was to provide food for thought and give you some ideas and share existing capabilities of the SAP HANA Cloud and SAP BTP. The key messages I would like you to take away:

  • SAP HANA Cloud provides full implementation of the data pyramid concept – including integration with 3rd party object stores.
  • Integration between SAP HANA Cloud, HANA database, Data Lake Relational Engine and Files is straightforward and for HANA DB-managed Data Lake deployment, is provided automatically.
  • There are several capabilities available to interact with each tier and to move data into and between the tiers.
  • Data federation from SAP HANA Cloud, Data Lake Relational Engine and Files (with SQL on Files and SDA) to the HANA database, provides complete and transparent access to data across the pyramid.

You can see how the flexible access to data cross the pyramid can help in building sophisticated scenario in the blog SAP BTP Data & Analytics Showcase – Machine Learning via Python in SAP Data Warehouse Cloud by wei.han7.

Refer to these useful references for additional information:

 

3 Comments