Data Hackery is a passion of mine, and as long as I have been involved in SAP and OpenSource, I have delighted in finding new and interesting ways of getting more out of what we already have, or making connections between data in SAP and things that haven't been apparent to me before.
To me - data hackery is typified by curiosity, and the need/desire to iteratively reshape and query data to gain new insights, and to satisfy questions and problems that we are burning to resolve.
This Christmas just passed, I started investigating the In-Memory database landscape - in particular in the OpenSource world, and found that there is already quite a lot to see out there eg: http://en.wikipedia.org/wiki/In-memory_database, but what I really wanted was to explore the paradigm of column oriented in-memory stores: http://en.wikipedia.org/wiki/Column-oriented_DBMS - and in this I found one that really stuck out to me - MonetDB .
MonetDB is a column oriented in-memory database, that has a very simple implementation model and respects SQL'99 to SQL'03. Performance is primarily tuned in the OS, so the main thing you have to look out for is that you have enough memory to do what you are attempting to do. The second thing is to think carefully about your queries - don't use SELECT * when you don't need to - target the columns required. Tuning based on indexes are mostly meaningless, but you do need to consider how Bitmap Indexes work. To me, this is a brilliant low-cost solution to a fast data manipulation and aggregation problem.
With this, I get to the point where I have a personal column oriented In-Memory database with scalability (mostly) limited by how much memory I can throw at it, R and SAP - this is my idea of fun, and all I need to do is mix them together.
I want to be able to take data from SAP, optionally augment it from external sources, and then rapidly perform data analysis in an iterative fashion - the key thing about 'rapid' is that there is low impedance to the process of discovery - and this is where I think there is a good mix in:
We can achieve the coupling of this using two connectors - RSAP, and RMonet. (Note: this was updated from RMonetDB to RMonet as the package was renamed due to a naming clash - 9/02/2013).
I have covered the process for installing and using RSAP in a previous post here.
RMonetDB as a database connector is a new subject and the installation process is described here. First, you must install MonetDB itself, along with the client tools. These are available at the download page. Installation for Ubuntu/Debian is very simple and described here.
Before you go much further, I recommend looking through (and trying) the Getting Started tutorial - this steps you through the basic administration process (quick and easy), and handling some simple queries.
To demonstrate my point - my example is based on the generated test data in tables SFLIGHT2, and SBOOK. The data is generated by running report SAPBC_DATA_GENERATOR.
As with all data analysis problems, I start with a question - "What were the worst flight to travel on based on predicted mood"?
I've equated mood of the flight to the effect that smokers are likely to have on the atmosphere - I'm assuming that the flights are non-smoking, and that the length of flight will have the effect of exacerbating withdrawal symptoms using a weighting metric - smokers are a function of time 0.5 * time * smokers (guessing that a smoker will want to light up every half hour when stressed, so their stress level goes up every 30 mins).
The basic process is to source the data in SAP, load it into MonetDB, let MonetDB do the heavy lifting in terms of processing and manipulation, and then use R to martial the process and take care of any presentation or visualisation requirements.
# load my dependent libraries
library(RMonet)
library(RSAP)
# connect to SAP and pull the data I require into data.frames
conn <- RSAPConnect("sap.yml")
print(system.time((sflights <- RSAPReadTable(conn, "SFLIGHTS2"))))
print(system.time((sbook <- RSAPReadTable(conn, "SBOOK", fields=list('CARRID', 'CONNID', 'FLDATE', 'SMOKER', 'CANCELLED')))))
RSAPClose(conn)
We now have two data frames sflights, and sbook that contain 4,880, and 1,370,282 rows respectively.
Which took:
SFLIGHTS2
user system elapsed
1.908 0.100 4.015
SBOOK
user system elapsed
106.559 19.661 347.879
These are actually shockingly bad times - almost 6 minutes for 1.3 million rows out of the SAP test drive NPL system? Well, I only have an Dell XPS-17 with 8Gb ram to run everything on ....
# connect to MonetDB and load my data.frames into tables
conn <- RMonetConnect(dbhost='localhost', dbuser='voc', dbpass='voc', dbname='voc')
print(system.time((res = dbwrite(conn, sflights, "sflights", drop=TRUE))))
print(system.time((res = dbwrite(conn, sbook, "sbook", drop=TRUE, tmp.file=TRUE))))
The two data frames (sflights, and sbook) are now loaded into MonetDB.
sflights loaded in:
user system elapsed
0.388 0.008 0.999
sbook loaded in:
user system elapsed
16.177 0.432 20.204
Not too bad for the old XPS ...
# find the worst flights to travel - mood due to smokers
print(system.time((worst <- dbquery(conn, 'SELECT "bs"."CARRID", "bs"."CONNID", "bs"."FLDATE", "f"."FLTIME", "bs"."SMOKERS",
("bs"."SMOKERS" * ("f"."FLTIME" / 30)) AS "STRESS_POINTS"
FROM (SELECT "b"."CARRID", "b"."CONNID", "b"."FLDATE", COUNT(*) AS "SMOKERS"
FROM voc.sbook b WHERE "b"."SMOKER" <> \'\' AND "b"."CANCELLED" = \'\'
GROUP BY "b"."CARRID", "b"."CONNID", "b"."FLDATE") bs
LEFT JOIN voc.sflights f ON "bs"."CARRID" = "f"."CARRID" AND
"bs"."CONNID" = "f"."CONNID" AND
"bs"."FLDATE" = "f"."FLDATE"
ORDER BY "STRESS_POINTS" DESC'))))
RMonetClose(conn)
We now have a data.frame (worst) that has the results of bookings summarised by flights with a calculation of stress points relative to journey length and number of smokers on board (completely fictional, I know).
Query run time:
user system elapsed
0.036 0.000 0.363
Well - I'm impressed. .36 of a second to JOIN, GROUP, SORT those two tables (plus an inner SELECT), on my limited resources.
The result set looks like:
> head(worst)
CARRID CONNID FLDATE FLTIME SMOKERS STRESS_POINTS
1 SQ 15 20120404 1125 49 1837.5
2 SQ 15 20130227 1125 49 1837.5
3 SQ 15 20120926 1125 49 1837.5
4 SQ 15 20120725 1125 49 1837.5
5 SQ 15 20120606 1125 49 1837.5
6 SQ 15 20110914 1125 48 1800.0
7 SQ 15 20111116 1125 48 1800.0
8 SQ 15 20121010 1125 48 1800.0
9 SQ 15 20120201 1125 48 1800.0
10 SQ 15 20111130 1125 48 1800.0
And the conclusion is flying long haul with lots of smokers going rabid is not great - especially on SQ-15 on the 4/4/2012.
Exploration of data and the time it takes, is a key concern to developing insights (especially in a world of shrinking deadlines, and times to market, with a growing global attention deficit disorder). This problem is magnified many fold when it comes to dealing with 'Big Data'. One way of dealing with this is through breaking enormous problems into little pieces and farming it out to cheap parallel processing, which is what Hadoop and MapReduce give you. However, this requires careful and elaborate planning which is costly, and I rather get the feeling that the winners of the big data race will be those who can take advantage of tool-sets like R and MonetDB (and others yet to come - and yes, SAP-HANA is right in there) that reduce that impedance to getting from question to answer in a cheap and accessible manner.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
14 | |
10 | |
9 | |
7 | |
5 | |
4 | |
4 | |
4 | |
4 |