Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

In this article, I'm going to explain through experience why you need to break up UPDATE statements as part of a data loading process. And unlike the hombre from the movie "Treasure of the Sierra Madre" ---

http://www.deadredart.com/madre.jpg

Image from: http://www.deadredart.com/madre.jpg

You can't ignore issues like this when working in relatively scare memory environments like HANA One.

Wikipedia and Big Data

I've been working on a "Big Data" project using AWS Hive against Wikipedia's page hit history files for the month of April 2013 to do aggregations over the data and then load it into SAP HANA One and then use the new Lumira tools to do analysis on the results. Each hour, Wikipedia kicks out a file that averages around 7 million records that show the Wikipedia project code, the page name hit, the number of hits and the number of bytes served for that page during the hour. If you go back to my SAP HANA – Hot N Cold – HANA for Hot Data - Hadoop for Cold Data post, my inclination was to just process this using AWS Hive. I'm sticking to my guns - especially after my little experiment going with just a HANA One approach.

What I did was use the Linux wget command to pull down the .gz files for each day for April 2013 into separate directories on a 500 GB EBS drive that I added to my AWS HANA One instance (blog post pending on this :smile: ). For April 1 2013, I loaded up 167,879,044 records for the 24 hour period. On disk, the uncompressed files took up 8.45 GB of disk space. Now - multiple this by 30 days - I'm looking at 253.66 GB of uncompressed data. You can see - I'm going to push the boundaries of HANA One here. Not to mention - Wikipedia is a great target to test your Big Data skills.

Loading up HANA

What I did was create the following COLUMN table to import the data:

CREATE COLUMN TABLE "WIKIPEDIA"."PAGEHITS"

(

"PROJNAME" VARCHAR(50),

"PAGENAME" VARCHAR(500),

"PAGEHITCOUNTFORHOUR" BIGINT CS_FIXED,

"BYTESFORHOUR" BIGINT CS_FIXED,

"HITYEAR" VARCHAR(4),

"HITMONTH" VARCHAR(2),

"HITDAY" VARCHAR(2)

)

I then did a series of 30 IMPORT statements that looked like this to load the data.

IMPORT FROM CSV FILE '/wiki-data/year=2013/month=04/day=01/pagecounts-20130401-010000'

INTO wikipedia.pagehits

WITH RECORD DELIMITED BY '\n'

FIELD DELIMITED BY ' ';

I did a COUNT(*) query to make sure I got all the rows and then went to do the following UPDATE statement to set the HITYEAR, HITMONTH and HITDAY values for each of the records.

UPDATE wikipedia.pagehits SET hityear = '2013', hitmonth = '04',hitday='01' WHERE hitday IS NULL;

After 10 minutes plus minutes of grinding away, I got the following error:

Could not execute 'update wikipedia.pagehits set hityear = '2013', hitmonth = '04', hitday='01''
in 10:50.841 minutes .

[129]: transaction
rolled back by an internal error: Memory allocation failed

Ouch! Sure enough, I looked at the system memory using the System Monitor in SAP HANA Studio and it showed the Database Resident Memory was taking 31+ GB of RAM!

Time for Batches

It's obvious now after looking over all the DELTA MERGE architecture diagrams that HANA really doesn't like to do large updates against 167 million records. Although it's a bit surprising as I was updating integer values that were all the same value. I hope one of the SAP engineers can help explain this in detail. Regardless, my next step was to check an make sure that I could at least import and update one hour at a time.

So, after importing each hour of data which ranged from about 55 seconds to 70 seconds after 24 loads to load around 7 million records each time. I then performed the following update statement to set the date values.

UPDATE wikipedia.pagehits SET hityear = '2013', hitmonth = '04',hitday='01' WHERE hitday IS NULL;

The update statements averaged around 30 seconds each. Best of all, I was able to process an entire day at one time. I also watched the Database Memory like a hawk and saw it go from 0.3 GB to 7.7 GB after 20 import and update statements. Then an amazing thing happened after the next load - Database Memory dropped down to 3.9 GB. What I suspect happened is that the compression algorithm kicked in and realized that lots of the page names were really duplicates and decided to compress the data - sweet! :smile:

At the end of the operation, memory size was at 4.1 GB. Plenty of room left to add more data. More importantly - I didn't run out of memory with the smaller batches. After restarting the server and then loading the table into memory, my memory size is sitting at 3.2 GB - I got even more compression when the table was loaded off of disk!

What's Next

So, back to our hombre

Yes - you do need batches! If you think you are pushing the memory limits of your HANA instance - consider breaking up big set based operations like updating all of your inventory items cost by 10% into smaller batches.

Back to Big Data and HANA. A good friend of mine, Buck Woody (http://twitter.com/buckwoody) once asked someone - which is faster - a dump truck or a sports car? If you are a DBA at heart - your answer it should be - it all depends. If you are trying to move 10 tons of compost from the nursery to your home - it's going to take a lot more trips (time) in a sports car compared to the one trip needed for the dump truck.

With AWS, I was able to load all the needed Wikipedia files into S3 storage in their compressed form and then use a Hive job to get the top 1000 page hits for English language Wikipedia articles  for each day in April using 6 beefy EC2 instances in under four hours. No import and update time was needed. No trying to configure the HANA One instance to increase the size of the 190 GB data drive to 400 GB to hold the uncompressed data. Plenty of headroom to process more that one month of data without having to unload the data from HANA One to load in the next months of data.

I hope you found this article interesting. Please rate this as you see fit if you want to see more "real-life" experiences from me in my HANA adventures.

Regards,

Bill Ramos

P.S. Be sure to check out my interview with SAP at: Five Questions for... Bill Ramos

6 Comments
Labels in this area