HANA PARTITIONING CONCEPTS ,SQLs,TECHNICAL BACKGROUND
Here I wish to discuss about HANA partitioning concepts , how to determine the number of partitions,how to determine the optimal column,how does HANA handles the partition command ,Types of basic partitions and so on.
Why do we need to partition tables in HANA ?
As HANA can not hold more than 2 billion records per column store table or per partition , any table which has crossed more than 1.5 billion rows are valid candidates of partitioning . It also helps in parallel procession of queries , if they are distributed across nodes in case of HANA scale out scenario producing results faster .
CASE STUDY: Here I am going to change an existing table which is partitioned on all the key fields to partition on 1 optimum key column: ie, The partition of a table EDIDS is going to be changed from HASH 21 MANDT,DOCNUM,LOGDAT,LOGTIM,COUNTR to HASH 7 DOCNU . The main reason is because , many SQLs and joins that are running against this table are running with where clause on DOCNUM .
Steps:
Log into studio and display the table and make a note of details.

Find the table and display: Observe the things highlighted below .Column Key: Primary key fields of any table.

Check the runtime properties.:
Things to observer here .
Number of entries =Row count(3.1 billion)
Memory consumption in main and in Delta is also indicated there.
Partition specification: 21 partitions distributed across 7 slave nodes ,3 partitions per node.
So how does HANA handles this partitioning?
When the alter command specified in previous gets executed below things happen in backend in HANA DB.
We can check for the status of the threads from performance tab and filter it out based on the master node and into the server on which this table is sitting.

Look for the SQL that we had executed:Choose the master node and the slave node to which we had moved the partitions to.

Click ok.
Sort by application user in performance tab. Though I had logged in as SYSTEM for re-partition, HANA will also know the WTS via which I am connecting to this DB.

In the background
1.My SQL gets prepared and delta merge for that table will first executed.
Here is my delta merge job that got triggered and the other jobs that will follow the same.
Below delta merge job has got completed 59/63 ..All the remaining jobs has 30 steps and it will run in parallel post delta merge completes.

so,Why is the max progress for _SYS_SPLIT_EDIDS shows as 30 ? It is actually the total number of columns in that table (including internal columns)
I have used M_CS_ALL_COLUMNS rather than M_CS_COLUMNS because the former also holds hana internal columns against the lateral

Lets check the threads now.

After 6 hours, my re-partition got completed .Now I distribute the 7 partitions between the 7 HANA nodes. Moving back the partitions:

Post re-partition and table distribution to different nodes.


Here if we see the total 3.1 billion is spread across 21 partitions with around 148 million records count per partition.
To partition command to work, we need to first move all the partitions from different node to single node.
Below I choose a to leave partitions 1,2,3 to the same server and move the other partitions from other servers to this server.
Sample SQL:
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 4 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 5 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 6 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 7 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 8 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 9 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 10 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 11 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 12 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 13 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 14 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 15 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 16 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 17 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 18 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 19 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 20 TO '<server_name>:3<nn>03' PHYSICAL;
ALTER TABLE "SAPERP"."EDIDS" MOVE PARTITION 21 TO '<server_name>:3<nn>03' PHYSICAL;
o/p:

Post this we have all the partition in one node in a scale out environment.

Now I proceed with partitioning command.
ALTER TABLE SAPERP.EDIDS PARTITION BY HASH (DOCNUM) PARTITIONS 7;

DETERMINING THE OPTIMUM COLUMN and NUMBER FOR PARTITION
So why would I consider DOCNUM as optimal column to be partitioned and why the number of partitions was choosen as 7?
In order to choose a column for partitioning we need to analyze the table usage in depth .
Analyzing and understanding the SQLs on the table:
1.Check with the functional team. Ask them which column does they query frequently and which column is always part of where clause . If they are not very clear on the same we can help them with plan cache data.
Eg: If a table keep getting queried based on a where clause on year, and in general the table has data since 2015, a Range partition on YEAR column will create a clear demarcation between hot and cold data .
Sample SQL for range partition:
ALTER TABLE SAPERP.BSEG PARTITION BY RANGE (BUKRS) (PARTITION 0 <= VALUES < 1000, PARTITION 1000 <= VALUES < 1101, PARTITION 2000 <= VALUES < 4000,PARTITION 4000 <= VALUES < 4031, PARTITION 4033 <= VALUES < 4055, PARTITION 4058 <= VALUES < 4257, PARTITION VALUE = 4900, PARTITION OTHERS);
2.Check M_SQL_PLAN_CACHE .
With the help of below query we can get a list of queries that are to identify the where clause
select statement_string, execution_count, total_execution_time from m_sql_plan_cache where statement_string like '%<TABLE_NAME>%' order by total_execution_time desc;

3.How to check the columns that are getting joined on this table using Join statistics?
Download the SQL HANA_SQL_Statistics_JoinStatistics_1.00.120+ from OSS 1969700 and modify the SQL like below.
Goto modification section.

Modify the modification section as per your requirement.

From above we can see that most of the joins are happening on DOCNUM and hence a HASH on this column will make the query runtime faster .
4. There can be some cases where we will not have enough information even from here or it will be empty . In that case we will have to enable the sql trace for the specific table .

5.If there is no specific range values that are frequently queried and if we have a case like most of the columns are used most of the times a HASH algorithm will be a best fit.
It is similar to round robin partition but , data will be distributed according to the hash algorithm on their one or 2 designated primary key columns.
-A Hash algorithm can only happen on a PRIMARY key field.
-It is always optimal not to choose more than 2 primary key field for HASH
-Within primary key , check for which row has maximum distinct records . That specific column can be chosen for re-partition .
To determine which primary key column we can choose for re-partition, perform below.
a.Load the table fully into memory
b.Open the table from studio and find the primary key for the table

c.Find the primary key column which has maximum distinct record.

Hence a HASH on DOCNUM for EDIDS table is best choice.
ALTER TABLE SAPERP.EDIDS" PARTITION BY HASH (DOCNUM) PARTITIONS 7
NOTE:When ever a table is queried and if that table or partition is not present in memory HANA automatically loads it into the memory either partially or fully. If that table is partitioned ,which ever row that is getting queried, then that specific partition which has the required data gets into memory. Even if you only need 1 row from a partition of a table which has 1 billion records, that entire partition will get loaded either partially or fully . In HANA we can never load 1 row alone into memory from a table.
So,How to determine the number of partitions in scale out scenario?
The optimal number of partition should be decided optimally such that the load gets distributed across different nodes in HANA in case of scale-out scenario.For a table which is not yet partitioned and which is nearing more than 1.5 billion.
It is recommended to always go for n-1 number of partition where n is the total number of nodes in HANA including Master node. Why -1 is because we should not have any partitions sitting on master node as it will always be better to leave the master node alone to perform transaction processing alone.
Eg:
HANA – 1 master node,3 slaves,1 standby => Then the number of partitions is 3 . (Always don’t take into count the standby node as it does not have any data volumes assigned to it)
HANA – 9 nodes =1 Master,7 slaves,1 standby =>Partition number should be 7 and it should be distributed across the worked nodes via the move command specified
GENERAL RULE OF THUMB IN PARTITION
-We can have any number of partitions as you specify, but it is generally not advisable to have any tables with more than 100 partitions
-It is recommended to have atleast 100 million records per partitions and each partition memory utilization at any point of time should not be more than 50GB.
-Any non-partitioned table which has more than 2.14 billion records will make the system crash and hence it is best to put a partition when it is at 1.5 billion
-For tables which are already partitioned and we have to re-partition as either the current partition is not optimal or if the record count per partition is more than 1.5 billion we need to increase the number of partitions
-Always choose the number of partition as a multiple or dividend of current partition . Eg:If a table is partitioned with 7 partition , you can increase the partition to 7*2=14 or so on. This is only useful to make the partition job run parallelly.
-HANA will crash if the number of rows per partition is more than 2.14 billion
TYPES OF BASIC PARTITIONS AND THEORY INVOLVED
NOTE:1.If we have primary key fields we can use only HASH or RANGE .2.ROUNDROBIN can only be used when we don’t have any primary key fields.3.RANGE PARTITION can not be based on non primary key fields.4.HASH partition can only be done on primary key fields like range partition.5.HASH algorithm is based on primary key fields which has maximum number of distinct records as already discussed which can be determined after loading the table into memory.6.We can find the most frequently ran SQL on any table from M_SQL_PAN_CACHE as shown which might also help in determining the best algorithm based on “WHERE” clause in SQL.select statement_string, execution_count, total_execution_time from m_sql_plan_cache where statement_string like '>table_name>%' order by total_execution_time desc;HANA Partitioning- Partitioning Performance TipsThanks for reading!!Please leave your comments and questions here!!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 9 | |
| 8 | |
| 7 | |
| 6 | |
| 3 | |
| 3 | |
| 3 | |
| 3 | |
| 3 | |
| 3 |