Data and Analytics Discussions
Get involved in Data & Analytics discussions. Engage in vibrant conversations, share insights, and explore perspectives on data and analytics topics.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best approach to compare Huge HANA Tables

kirankumar440
Participant
618

Hello All,

I have 2 tables with 100 columns each with same structure & I have 70 Million rows in each table.

I wanted to do a apple to apple comparisons to make sure the data is same b/w those 2 tables.. But the below Query taking lot of time.

Select * From T1 Minus select * from T2

Any faster way to achieve this ?

Br

Kiran K

10 REPLIES 10

FrankRiesner
Product and Topic Expert
Product and Topic Expert
0 Kudos
548

Hi Kiran - I propose you model a Calculation View with calculated columns which do the calculation KF T1 - KF T2. This new column could be evaluated on a more aggregated level and you could analyze in more detail only if there are results <> 0.

KR Frank

0 Kudos
525

Thanks @FrankRiesner for the response. Do you mean the graphical CV with Join condition on 60 dimensions b/w t1,t2 ( lets say out of 100 columns 60 are Dims , 40 are measures ) would be much faster than the comparison using MINUS approach ? As I wanted to do the data comparison for each column including DIMs as well along with measures .

Thanks

Kiran  

XaviPolo
Active Contributor
510

I believe your option is the correct one. EXCEPT/MINUS is what I would use for efficiency reasons. I would discard the JOIN due to its poorer performance in HANA compared to columnar operations.

Perhaps you could consider creating a HASH of the row using something like HASH_MD5() and comparing them instead of using all the columns. However, I’m not sure if you would gain anything from this. Try creating a PLAN and evaluate the costs for comparison.

The more fields that are indexed, the faster the performance will be.

Another thing you could do is run the process in batches, using an indexed field or a primary key. For example, you could first compare the records from the year 2024, then 2023, and so on, using SQL Script.

Regards,

 


XP,

Seidor - Human Focused, Technology Experts

0 Kudos
500

Thank you @XaviPolo for the inputs.

I will explore on this function HASH_MD5() which I have not used before & apply.

Currently I have tried the below approach via Storedproc that seems to be working much faster than MINUS, JOIN operations.

Select D1,D2,D3,.....D50 , Sum(Amount_new) , Sum(Amount_OLD)

From 

(

Select D1,D2,D3,.....D50, Amount as Amount_new , 0 as Amount_Old from Table_New

Union All

Select D1,D2,D3,.....D50, 0 as Amount_New , Amount as  Amount_Old  as from Table_old

)

Group by D1,D2,D3,.....D50

Having  Sum(Amount_new) <>  Sum(Amount_old)

 

 

Br

Kiran K

0 Kudos
487

Yes, anything columnar (like union) is faster.

Here you are not really comparing all fields, just the dimensions and a single measure (the aggregate).
If that's OK in your scenario, it's better than comparing all fields like MINUS does.


XP,

Seidor - Human Focused, Technology Experts

0 Kudos
481

Yes , I have just given as one example measure.

Infact I have included all of my Dims, Meaures with a OR clause at Where condition.

Br

Kiran K

0 Kudos
424

Hi @kirankumar440 

I'm glad it's working faster with all the measures, and that it's working for you in your scenario.

Just as a note, doing this is not exactly the same as a MINUS.
When you aggregate rows you're looking at a different version than the original, so you're not comparing exactly the same thing. Let me explain.
If in one table you have A,1 ; A,2 ; A;3 aggregated you will have A,1+2+3 = A,6 ... if in the other table you have A,1 ; A,5, aggregated you will have also A,6 so comparing aggregates will tell you it is the same, and it is true, the sum is the same, but not the rows themselves.

Again, you know your scenario, and maybe this is not the case, and comparing aggregates is enough for you.

Regards,


XP,

Seidor - Human Focused, Technology Experts

0 Kudos
396

Yes , but I have used that aggregation to combine the 2 rows out of UNION ALL into a Single row to have my measures from both the sources together in one row. That will help me to compare the KPI values.

In fact I have included all of my table columns( Dims+measures) in the both inner , outer queries as below. That aggregation is used to bring them into a Single line for easy comparison.

Happy learn , If I am missing anything still.

lets assume I have 50 DIms , 3 measures in both Table_New , Table_Old

Select D1,D2,D3,.....D50 , Sum(M1_new) , Sum(M2_new), SUM(M3_new),Sum(M1_old) , Sum(M2_old), SUM(M3_old),

From 

(

Select D1,D2,D3,.....D50, M1 as M1_new , M2 as M2_new , M3 as M3_new ,0 as M1_old,

0 as M2_old , 0 as M3_Old    from Table_New

Union All

D1,D2,D3,.....D50, 0 as M1_new , 0 as M2_new , 0 as M3_new ,M1 as M1_old,

M2 as M2_old , M3 as M3_Old 

from Table_old

)

Group by D1,D2,D3,.....D50

Having  Sum(M1_new ) <>  Sum(M1_old ) OR

        Sum(M2_new ) <>  Sum(M2_old ) OR

    Sum(M3_new ) <>  Sum(M3_old ) 

     

Br

Kiran K

 

0 Kudos
377

If you CANNOT have multiple rows for the same dimension values, then OK.

If you CAN have multiple rows, then what I mentioned only impacts if you can have multiple rows because:

TABLE A

D1D2D3M1
ABC1
ABC3

TABLE B

D1D2D3M1
ABC2
ABC2

UNION ALL

D1D2D3M1 OLDM1 NEW
ABC10
ABC30
ABC02
ABC02

AGGR

D1D2D3M1 OLDM1 NEW
ABC44

As you can see, the AGGR comes out the same, but the rows in the two tables were not the same. This doesn't happen with the MINUS.

Regards,


XP,

Seidor - Human Focused, Technology Experts

373

Okay , it make sense when we have duplicates  , Thank for taking your time in explaining the case with an example. Its clear now what you are referring.

Br

Kiran K

 

Br

Kiran K