Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
swapan_saha
Product and Topic Expert
Product and Topic Expert
4,788
Many SAP HANA customers are using HANA smart data integration to simplify their data integration landscape to run real-time applications as announced in a previously published blog. Starting with SAP HANA Rev 122.04, HANA smart data integration introduced task partitioning in the Flowgraph Editor. Task partitioning helps our customers load large initial data sets faster and utilizes available memory from various supported sources into SAP HANA. While SAP HANA Rev 122.04 introduced full partitioning support in the Flowgraph Editor, the SAP HANA 122.05 release introduced single-level (or single column) partitioning in the replication task. SAP HANA Rev 122.06 includes multi-level (or multi-column) partitioning in replication tasks.

The goals of these enhancements are:

  1. Optimize initial loading of large volume of data from various supported sources into SAP HANA in terms of loading time, memory utilization in SAP HANA and resource utilization at the data sources

  2. Support a partitioned HANA column table as input with more than 2 billion rows (but less than 2 billion rows in each partition)


Using this newly introduced feature, internal testing teams and early customer adopters have experienced 2-10 times improved performance in completing a large initial load of data using task partitioning while running tasks in parallel. If initial loading time is not critical, customers can partition the data and run them sequentially. This reduces memory consumption at the target SAP HANA and avoids the out of memory error when the available memory is not sufficient.

To illustrate this task partitioning feature, let’s use two sample internal test scenarios. In the first scenario, we use a narrow table (few columns) and in the second scenario a wide table (many columns).

Details for Scenario 1

  • Number of rows = 3.5 Billion

  • Number of columns = 14

  • Data Size at source = 500GB

  • Number of partitions = 12


For this scenario, we partitioned the source data based on range values equally across all partitions and executed the tasks both sequentially and in parallel. The results are presented here.


















Mode Throughput (GB/hr) Peak Memory at Target HANA (GB)
Source partitioned and executed sequentially 38 183
Source partitioned and executed in parallel 136 650

 

Without source portioning, this scenario would have failed due to an out of memory error in the test HANA server.

Details for Scenario 2

  • Number of rows = 66M

  • Number of columns = 227

  • Data Size at source = 500GB

  • Number of partitions = 8


For this case, the corresponding loading throughputs and max memory consumptions are summarized here. The source data is partitioned similar to Scenario 1.























Mode Throughput (GB/hr) Peak Memory at Target HANA (GB)
No source partition 76 385
Source partitioned and executed sequentially 77 51
Source partitioned and executed in parallel 476 383

 

These two sample results show how to improve performance by loading large source data using the task partitioning feature. The first scenario shows a throughput improvement from 38 GB/hr to 136GB/hr, whereas the second scenario shows that the throughput increased from 77 GB/hr to 476GB/hr. Task partitioning allows SAP HANA to read, process and commit the partitioned virtual table input sources in parallel. Notice in the second scenario that a customer with the same amount of data who runs the replication task sequentially uses only 51 GB of the memory in the target. The second scenario shows that partitioning and executing in parallel rather than without using task partitioning, returns a much higher throughput (476 GB/hr vs 76 GB/hr) consuming the same memory on the HANA side.

You can define task partitions in the Partitions tab within the Replication Editor. Two partition types are available: range partitions and list partitions.

This feature is described in sections 6.1.3 and 6.1.4 of Best Practices for SAP HANA Smart Data Integration and SAP HANA Smart Data Quality,

With this enhancement, we believe all our customers will benefit optimizing their HANA memory utilization in loading large initial data and will address HANA partitioned table scenario with more than 2 billion records.
4 Comments
former_member183326
Active Contributor
0 Kudos
Best to jump straight to 122.05.
0 Kudos
Hi Swapan,

It seems this task partitioning wont possible if the source is actually a calculation view. Please advise if any alternatives possible to improve performance for initial load when the source is a calculation view.

Thanks

Siva
0 Kudos
the Hana SDI & Modelling Guild states:

Partitioning at the task level is useful when your input data has several million rows or more. Currently, SAP HANA has a limitation of processing more than two billion rows. Partitioning your data at the task level will likely reduce the load to less than two billion rows per partition. Typically, you only see a benefit of using task level partitioning with extremely large data sets.

Is the reference to 'several million' rows in the documentation a typo, when the guide goes on to talk about 2 billion row limition, and that task partitioning should only be used on extremely large data sets?
sumitbajaj599
Discoverer
0 Kudos
Has anyone worked on SDI flowgraph partitioning in HANA 1.0 SP12. we are using calculation view which has current_date as input parameter $$IP_DELTA_TIMESTAMP$$. We are using this input parameter in flowgraph to pass value. When we try to execute flowgraph without partitioning it runs fine while when we apply partitioning it throws below error.

Instantiation of calculation model failed;exception 306106: Undefined variable: $$IP_DELTA_TIMESTAMP$$. Variable is marked as required but not set in the query (please check lines: )