2 weeks ago - last edited 2 weeks ago
Hello All,
I have 2 tables with 100 columns each with same structure & I have 70 Million rows in each table.
I wanted to do a apple to apple comparisons to make sure the data is same b/w those 2 tables.. But the below Query taking lot of time.
Select * From T1 Minus select * from T2
Any faster way to achieve this ?
Br
Kiran K
2 weeks ago
Hi Kiran - I propose you model a Calculation View with calculated columns which do the calculation KF T1 - KF T2. This new column could be evaluated on a more aggregated level and you could analyze in more detail only if there are results <> 0.
KR Frank
2 weeks ago
Thanks @FrankRiesner for the response. Do you mean the graphical CV with Join condition on 60 dimensions b/w t1,t2 ( lets say out of 100 columns 60 are Dims , 40 are measures ) would be much faster than the comparison using MINUS approach ? As I wanted to do the data comparison for each column including DIMs as well along with measures .
Thanks
Kiran
2 weeks ago
I believe your option is the correct one. EXCEPT/MINUS is what I would use for efficiency reasons. I would discard the JOIN due to its poorer performance in HANA compared to columnar operations.
Perhaps you could consider creating a HASH of the row using something like HASH_MD5() and comparing them instead of using all the columns. However, I’m not sure if you would gain anything from this. Try creating a PLAN and evaluate the costs for comparison.
The more fields that are indexed, the faster the performance will be.
Another thing you could do is run the process in batches, using an indexed field or a primary key. For example, you could first compare the records from the year 2024, then 2023, and so on, using SQL Script.
Regards,
2 weeks ago
Thank you @XaviPolo for the inputs.
I will explore on this function HASH_MD5() which I have not used before & apply.
Currently I have tried the below approach via Storedproc that seems to be working much faster than MINUS, JOIN operations.
Select D1,D2,D3,.....D50 , Sum(Amount_new) , Sum(Amount_OLD)
From
(
Select D1,D2,D3,.....D50, Amount as Amount_new , 0 as Amount_Old from Table_New
Union All
Select D1,D2,D3,.....D50, 0 as Amount_New , Amount as Amount_Old as from Table_old
)
Group by D1,D2,D3,.....D50
Having Sum(Amount_new) <> Sum(Amount_old)
Br
Kiran K
2 weeks ago
Yes, anything columnar (like union) is faster.
Here you are not really comparing all fields, just the dimensions and a single measure (the aggregate).
If that's OK in your scenario, it's better than comparing all fields like MINUS does.
2 weeks ago
Yes , I have just given as one example measure.
Infact I have included all of my Dims, Meaures with a OR clause at Where condition.
Br
Kiran K
2 weeks ago
I'm glad it's working faster with all the measures, and that it's working for you in your scenario.
Just as a note, doing this is not exactly the same as a MINUS.
When you aggregate rows you're looking at a different version than the original, so you're not comparing exactly the same thing. Let me explain.
If in one table you have A,1 ; A,2 ; A;3 aggregated you will have A,1+2+3 = A,6 ... if in the other table you have A,1 ; A,5, aggregated you will have also A,6 so comparing aggregates will tell you it is the same, and it is true, the sum is the same, but not the rows themselves.
Again, you know your scenario, and maybe this is not the case, and comparing aggregates is enough for you.
Regards,
2 weeks ago
Yes , but I have used that aggregation to combine the 2 rows out of UNION ALL into a Single row to have my measures from both the sources together in one row. That will help me to compare the KPI values.
In fact I have included all of my table columns( Dims+measures) in the both inner , outer queries as below. That aggregation is used to bring them into a Single line for easy comparison.
Happy learn , If I am missing anything still.
lets assume I have 50 DIms , 3 measures in both Table_New , Table_Old
Select D1,D2,D3,.....D50 , Sum(M1_new) , Sum(M2_new), SUM(M3_new),Sum(M1_old) , Sum(M2_old), SUM(M3_old),
From
(
Select D1,D2,D3,.....D50, M1 as M1_new , M2 as M2_new , M3 as M3_new ,0 as M1_old,
0 as M2_old , 0 as M3_Old from Table_New
Union All
D1,D2,D3,.....D50, 0 as M1_new , 0 as M2_new , 0 as M3_new ,M1 as M1_old,
M2 as M2_old , M3 as M3_Old
from Table_old
)
Group by D1,D2,D3,.....D50
Having Sum(M1_new ) <> Sum(M1_old ) OR
Sum(M2_new ) <> Sum(M2_old ) OR
Sum(M3_new ) <> Sum(M3_old )
Br
Kiran K
2 weeks ago
If you CANNOT have multiple rows for the same dimension values, then OK.
If you CAN have multiple rows, then what I mentioned only impacts if you can have multiple rows because:
TABLE A
D1 | D2 | D3 | M1 |
A | B | C | 1 |
A | B | C | 3 |
TABLE B
D1 | D2 | D3 | M1 |
A | B | C | 2 |
A | B | C | 2 |
UNION ALL
D1 | D2 | D3 | M1 OLD | M1 NEW |
A | B | C | 1 | 0 |
A | B | C | 3 | 0 |
A | B | C | 0 | 2 |
A | B | C | 0 | 2 |
AGGR
D1 | D2 | D3 | M1 OLD | M1 NEW |
A | B | C | 4 | 4 |
As you can see, the AGGR comes out the same, but the rows in the two tables were not the same. This doesn't happen with the MINUS.
Regards,
2 weeks ago
Okay , it make sense when we have duplicates , Thank for taking your time in explaining the case with an example. Its clear now what you are referring.
Br
Kiran K
Br
Kiran K