PAL is an optional component from SAP HANA and the main porpoise of PAL is to enable the modelers to perform predictive analysis over big volumes of data. Currently, PAL includes classic and universal predictive analysis algorithms from six data-mining categories:
To be able to call PAL functions from SQLScript you need to be on SPS05 and install the Application Function Library (AFL) which includes PAL and BFL (Business Functions Library).
In this post I’m going to focus on how to use the Apriori Algorithm. Let’s take a look at what the PAL user guide says about it:
“…Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior…”
In other words, this algorithm will find frequent itemsets. This algorithm is designed to run on transactional tables, that means that you can for example use your Sales Order Items table to find patterns on how your customers consume your products or services without the need of having to pre-process the information stored in the table. So running this algorithm is quite simple. The easiest way to understand the porpoise of this algorithm is to go to Amazon and browse a product (any), you’ll notice that there is a section called “Frequently Bought Together”. The content of that section is created using an Association Algorithm like the one we are talking about in this blog (if not the same).
When you run the Apriori procedure in SAP HANA, the results will be stored in a column table that contains all the associations uncovered by the algorithm with three measures: Support, Confidence and Lift
Is the proportion of transactions in the data set which contain a particular itemset. Let’s say we have a Sales Order Items table that looks like this:
Transaction ID | Item |
1 | Milk |
1 | Bread |
2 | Butter |
3 | Beer |
4 | Milk |
4 | Bread |
4 | Butter |
5 | Milk |
5 | Butter |
In this example, the Itemset {Milk, Bread, Butter} will have a Support = 1/5, because only 1 transaction out of 5 contains that particular Itemset. If we use the Itemset {Milk, Bread}, the Support will be 2/5, because 2 transactions out of 5 contain that combination.
This measure will tell you how often a customer buys Item B when he buys Item A. For example, if we use the table above, the Confidence of {Milk, Bread} will be 2/3, because out of 3 transactions that contain Milk, only 2 contain the Itemset {Milk, Bread}. So the Confidence is defined as:
In general, if we have a high Support and a high Confidence, this means we have a strong association. However, this can be deceptive in some cases because if Item A and Item B have a high support, we can have a high value for confidence even when they are independent. A better measure to judge the strength of an association rule is the Lift. The Lift ratio is the confidence of the rule divided by the confidence assuming the consequent (Item B) and the antecedent (Item A) are independent:
A Lift ratio greater than 1.0 suggests that we have a strong rule. The larger the Lift ratio, the greater is the strength of the association. In our previous example, the Lift Ratio for {Milk, Bread} will be:
Ok, now that we understand the different concepts associated with this algorithm, let the coding begin!
For my example I would like to know which #HASHTAGS are frequently twitted together in Twitter when Twitting about SAP HANA, so I downloaded a number of Tweets from Twitter (15.000 to be precise) that contain the words SAP and HANA. After downloading the data I created a column table with the following structure to store the Tweets:
CREATE COLUMN TABLE TWT (
"ID" SMALLINT,
"DATE" INTEGER, --> I use Integer for the date because I'm using the UNIX TimeStamp Format
"CONTENT" NVARCHAR(200),
PRIMARY KEY ("ID"));
After uploading the Tweets to SAP HANA, I ran a Text Analysis process by executing the following statement:
CREATE FULLTEXT INDEX TWT_FTI ON TWT("CONTENT")
TEXT ANALYSIS ON CONFIGURATION 'EXTARCTION_CORE';
The reason why I ran this Text Analysis process is because I want to identify all the #HASHTAGS in my Tweets. After creating the Full Text Index, a new table called $TA_TWT_FTI is created. Let’s take a look at the content of the table:
As you can see, in this table I can easily identify all the different #HASHTAGS by only filtering by column TYPE = SOCIAL_MEDIA/TOPIC_TWITTER. If you would like to know more about the Text Analysis Process, take a look at my previous post where I describe in detail how to use it. My next step is creating a SQL View on this table to show only #HASHTAGS:
CREATE VIEW "PAL_TRANS_TAB" AS
SELECT ID, TOKEN FROM "$TA_TWT_FTI"
WHERE TYPE = 'SOCIAL_MEDIA/TOPIC_TWITTER';
The resulting view looks like this:
This looks pretty similar to a Sales Order Items table, doesn’t it? Just replace the ID with the Sales Order ID and the Token with the Product or Service ID included in the Sales Order. Now I’m ready to run the Apriori algorithm:
SET SCHEMA _SYS_AFL;
/* CREATE TABLE TYPE FOR MY INPUT DATA */
DROP TYPE PAL_DATA_T;
CREATE TYPE PAL_DATA_T AS TABLE(
"ID" INT,
"TOKEN" VARCHAR(100)
);
/* CREATE TABLE TYPE FOR THE OUTPUT TABLE */
DROP TYPE PAL_RESULT_T;
CREATE TYPE PAL_RESULT_T AS TABLE(
"PRERULE" VARCHAR(500),
"POSTRULE" VARCHAR(500),
"SUPPORT" DOUBLE,
"CONFIDENCE" DOUBLE,
"LIFT" DOUBLE
);
/* CREATE TABLE TYPE FOR THE OUTPUT PMML MODEL */
DROP TYPE PAL_PMMLMODEL_T;
CREATE TYPE PAL_PMMLMODEL_T AS TABLE(
"ID" INT,
"PMMLMODEL" VARCHAR(5000)
);
/* CREATE TABLE TYPE FOR THE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */
DROP TYPE PAL_CONTROL_T;
CREATE TYPE PAL_CONTROL_T AS TABLE(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
/* CREATE TABLE THAT WILL POINT TO THE DIFFERENT TYPES I'M USING TO RUN THE ALGORITHM */
DROP TABLE PDATA;
CREATE COLUMN TABLE PDATA(
"ID" INT,
"TYPENAME" VARCHAR(100),
"DIRECTION" VARCHAR(100) );
/* FILL THE TABLE */
INSERT INTO PDATA VALUES (1, '_SYS_AFL.PAL_DATA_T', 'in');
INSERT INTO PDATA VALUES (2, '_SYS_AFL.PAL_CONTROL_T', 'in');
INSERT INTO PDATA VALUES (3, '_SYS_AFL.PAL_RESULT_T', 'out');
INSERT INTO PDATA VALUES (4, '_SYS_AFL.PAL_PMMLMODEL_T', 'out');
/* GENERATE THE APRIORI PROCEDURE */
DROP PROCEDURE PAL_APRIORI_RULE;
DROP TYPE PAL_APRIORI_RULE__TT_P1;
DROP TYPE PAL_APRIORI_RULE__TT_P2;
DROP TYPE PAL_APRIORI_RULE__TT_P3;
DROP TYPE PAL_APRIORI_RULE__TT_P4;
call SYSTEM.afl_wrapper_generator('PAL_APRIORI_RULE', 'AFLPAL', 'APRIORIRULE', PDATA);
/* CREATE TABLE THAT WILL CONTAIN THE INPUT PARAMETERS */
DROP TABLE PAL_CONTROL_TAB;
CREATE COLUMN TABLE PAL_CONTROL_TAB(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
/* FILL THE TABLE */
/* NUMBER OF THREADS TO BE USED */
INSERT INTO PAL_CONTROL_TAB VALUES ('THREAD_NUMBER', 2, null, null);
/* MINIMUM SUPPORT THRESHOLD */
INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_SUPPORT', null, 0.02, null);
/* MINIMUM CONFIDENCE THRESHOLD */
INSERT INTO PAL_CONTROL_TAB VALUES ('MIN_CONFIDENCE', null, 0.02, null);
/* CREATE THE OUTPUT TABLE */
DROP TABLE PAL_RESULT_TAB;
CREATE COLUMN TABLE PAL_RESULT_TAB(
"PRERULE" VARCHAR(500),
"POSTRULE" VARCHAR(500),
"SUPPORT" Double,
"CONFIDENCE" Double,
"LIFT" DOUBLE
);
/* CREATE THE OUTPUT TABLE WITH THE PMML MODEL */
DROP TABLE PAL_PMMLMODEL_TAB;
CREATE COLUMN TABLE PAL_PMMLMODEL_TAB (
"ID" INT,
"PMMLMODEL" VARCHAR(5000)
);
/* CALL THE APRIORI ALGORITHM */
CALL PAL_APRIORI_RULE(PAL_TRANS_TAB, PAL_CONTROL_TAB, PAL_RESULT_TAB, PAL_PMMLMODEL_TAB) WITH overview;
/* SHOW THE RESULTS */
SELECT * FROM PAL_RESULT_TAB;
SELECT * FROM PAL_PMMLMODEL_TAB;
This is how the output table looks like:
Of course, the strongest rules are {#HANA, #SAP} and {#SAP, #HANA} because that is the selection criteria I used to download the data from Twitter, but the interesting thing about this algorithm is that it not only creates rules with two items, the rules can contain multiple items, for example, people that Tweet using #HASHTAGS #SAP and #SAPTechEd in the same Tweet, usually include the #HASHTAG #HANA. I hope you enjoyed it!
Follow me on Twitter: @LukiSpa
Info en Español sobre SAP HANA™:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
14 | |
11 | |
8 | |
7 | |
5 | |
4 | |
4 | |
4 | |
3 |