PAL is an optional component from SAP HANA and its main porpoise is to enable modelers to perform predictive analysis over big volumes of data. If this is the first time you hear about PAL, I would recommend reading the official documentation. You can also take a look at my prior post where I talk about the Apriori Algorithm.
In this post I’m going to focus on how to use the K-Means clustering algorithm included in PAL because it’s one of the most popular and most commonly used in data-mining. But before we jump into the code, let’s talk about how the algorithm works.
According to Wikipedia, “clustering is the task of grouping a set of objects in a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)”. In other words, grouping your data into multiple clusters. The most common use case for a clustering algorithm is customer segmentation, meaning you use a clustering algorithm to divide your customer database in multiple groups (or clusters) based on how similar they are or how similar they behave, e.g., age, gender, interests, spending habits and so on.
The K-Means algorithm works in a very simple way (for me that I don’t have to code it in C++ J). The first step is to plot all the database objects into space where each attribute is a dimension. So if we use a two attributes data set the resulting chart would look something like this:
After all the objects are plotted, the algorithm calculates the distance between them and the ones that are close to each other are grouped in the same cluster. So if we go back to our previous example we can create 4 different clusters:
Each cluster is associated with a centroid and each point is assigned to the cluster with the closest centroid. The centroid is the mean of the points in the cluster. The closeness can be measured using:
Every time a point is assigned to a cluster the centroid is recalculated. This is repeated in multiple iterations until centroids don’t change anymore (meaning all points have been assigned to a corresponding cluster) or until relatively few points change clusters. Usually most of the centroid movement happens in the first iterations.
One of the main drawbacks of the K-Means Algorithm is that you need to specify the number of Ks (or clusters) upfront as an input parameter. Knowing this value is usually very hard, that is why it is important to run quality measurement functions to check the quality of your clustering. Later in this post we will talk about this.
I came across a very interesting paper that talks about segmentation in the telecommunication industry, so I thought it would be a very nice use case to demo the K-Means algorithm in HANA (if you are interested in this topic, I very much recommend reading this paper). These are the steps I followed:
The first step is creating a table that will contain information on customers mobile phone usage habits with the following structure:
CREATE COLUMN TABLE "TELCO" (
"ID" INTEGER NOT NULL, --> Customer ID
"AVG_CALL_DURATION" DOUBLE, --> Average Call Duration
"AVG_NUMBER_CALLS_RCV_DAY" DOUBLE, --> Average Calls Received per Day
"AVG_NUMBER_CALLS_ORI_DAY" DOUBLE, --> Average Calls Originated per Day
"DAY_TIME_CALLS" DOUBLE, --> Percentage of Calls made during day time hours (9 a.m. - 6 p.m.)
"WEEK_DAY_CALLS" DOUBLE, --> Percentage of Calls made during week days (Monday thru Friday)
"CALLS_TO_MOBILE" DOUBLE, --> Percentage of Calls made to mobile phones
"SMS_RCV_DAY" DOUBLE, --> Number of SMSs received per day
"SMS_ORI_DAY" DOUBLE, --> Number of SMSs sent per day
PRIMARY KEY ("ID"))
So each row in this table will represent a unique customer. Now I need to fill it, but I do not have access to real data, so I had to build my own dataset. I created 30 different customers (30 rows) that can be grouped in 3 segments:
The resulting table looks like this:
Now that I have my dataset, I’m ready to start coding. The first thing we need to do is generate the PAL procedure by calling the AFL Wrapper Generator. To do so we need to create a number of Table Types that will be used to define the structure of the data that will be used as input and output parameters:
SET SCHEMA _SYS_AFL;
/* Table Type that will be used as the output parameter
that will contain which cluster has been assigned to each
customer and what is the distance to the mean of the cluster */
DROP TYPE PAL_KMEANS_RESASSIGN_TELCO;
CREATE TYPE PAL_KMEANS_RESASSIGN_TELCO AS TABLE(
"ID" INT,
"CENTER_ASSIGN" INT,
"DISTANCE" DOUBLE
);
/* Table Type that will be used as the input parameter
that will contain the data that I would like to cluster */
DROP TYPE PAL_KMEANS_DATA_TELCO;
CREATE TYPE PAL_KMEANS_DATA_TELCO AS TABLE(
"ID" INT,
"AVG_CALL_DURATION" DOUBLE,
"AVG_NUMBER_CALLS_RCV_DAY" DOUBLE,
"AVG_NUMBER_CALLS_ORI_DAY" DOUBLE,
"DAY_TIME_CALLS" DOUBLE,
"WEEK_DAY_CALLS" DOUBLE,
"CALLS_TO_MOBILE" DOUBLE,
"SMS_RCV_DAY" DOUBLE,
"SMS_ORI_DAY" DOUBLE,
primary key("ID")
);
/* Table Type that will be used as the output parameter
that will contain the centers for each cluster */
DROP TYPE PAL_KMEANS_CENTERS_TELCO;
CREATE TYPE PAL_KMEANS_CENTERS_TELCO AS TABLE(
"CENTER_ID" INT,
"V000" DOUBLE,
"V001" DOUBLE,
"V002" DOUBLE,
"V003" DOUBLE,
"V004" DOUBLE,
"V005" DOUBLE,
"V006" DOUBLE,
"V007" DOUBLE
);
/* Table Type that will be used to specify
the different parameters to run the KMeans Algorithm */
DROP TYPE PAL_CONTROL_TELCO;
CREATE TYPE PAL_CONTROL_TELCO AS TABLE(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
/* This table is used to generate the KMeans procedure
and will point the AFL Wrapper Generator to the different
table types that we just created */
DROP TABLE PDATA_TELCO;
CREATE COLUMN TABLE PDATA_TELCO(
"ID" INT,
"TYPENAME" VARCHAR(100),
"DIRECTION" VARCHAR(100) );
/* Fill the table */
INSERT INTO PDATA_TELCO VALUES (1, '_SYS_AFL.PAL_KMEANS_DATA_TELCO', 'in');
INSERT INTO PDATA_TELCO VALUES (2, '_SYS_AFL.PAL_CONTROL_TELCO', 'in');
INSERT INTO PDATA_TELCO VALUES (3, '_SYS_AFL.PAL_KMEANS_RESASSIGN_TELCO', 'out');
INSERT INTO PDATA_TELCO VALUES (4, '_SYS_AFL.PAL_KMEANS_CENTERS_TELCO', 'out');
/* Creates the KMeans procedure that executes the KMeans Algorithm */
call SYSTEM.afl_wrapper_generator('PAL_KMEANS_TELCO', 'AFLPAL', 'KMEANS', PDATA_TELCO);
After executing this code we should see a new procedure in the _SYS_AFL schema called PAL_KMEANS_TELCO
I generated the K-Means procedure so now I need to write the code that will execute it:
/* This table will contain the parameters that will be used
during the execution of the KMeans procedure.
For Eexample, the number of clusters you would like to use */
DROP TABLE PAL_CONTROL_TAB_TELCO;
CREATE COLUMN TABLE PAL_CONTROL_TAB_TELCO(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100)
);
/* Fill the parameters table */
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('THREAD_NUMBER',2,null,null); --> Number of threads to be used during the execution
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('GROUP_NUMBER',3,null,null); --> Number of clusters
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('INIT_TYPE',4,null,null); --> This parameter will specify how to select the initial center of each cluster
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('DISTANCE_LEVEL',2,null,null); --> Which distance to use. In this case I'm using Euclidean
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('MAX_ITERATION',100,null,null); --> Maximum Iterations
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('EXIT_THRESHOLD',null,0.000001,null); --> Threshold value for exiting an iteration
INSERT INTO PAL_CONTROL_TAB_TELCO VALUES ('NORMALIZATION',0,null,null); --> How to normalize the data. In this case I'm not normalizing at all
/* This table will contain the resulting cluster
assignment for each customer and the distance to the center of the cluster */
DROP TABLE PAL_KMEANS_RESASSIGN_TAB_TELCO;
CREATE COLUMN TABLE PAL_KMEANS_RESASSIGN_TAB_TELCO(
"ID" INT,
"CENTER_ASSIGN" INT,
"DISTANCE" DOUBLE,
primary key("ID")
);
/* This table will contain the resulting centers for each cluster */
DROP TABLE PAL_KMEANS_CENTERS_TAB_TELCO;
CREATE COLUMN TABLE PAL_KMEANS_CENTERS_TAB_TELCO(
"CENTER_ID" INT,
"V000" DOUBLE,
"V001" DOUBLE,
"V002" DOUBLE,
"V003" DOUBLE,
"V004" DOUBLE,
"V005" DOUBLE,
"V006" DOUBLE,
"V007" DOUBLE
);
/* Execute the K-Means procedure */
CALL PAL_KMEANS_TELCO(TELCO, PAL_CONTROL_TAB_TELCO, PAL_KMEANS_RESASSIGN_TAB_TELCO, PAL_KMEANS_CENTERS_TAB_TELCO) with overview;
Pretty easy huh?
Ok, I have my code ready, but I’m missing a very important part, I still don’t know how many Ks I need to specify as the input parameter (well, I do know because I created the sample data, but let’s pretend I don’t know). There are multiple techniques to find out how many groups will produce the best clustering, in this case I will use the Elbow Criterion. The elbow criterion is a common rule of thumb that says that one should choose a number of clusters so that adding another cluster does not add sufficient information. I will run the code above specifying different number of clusters and for each run I will measure the total intra-cluster distance. When the distance does not decrease much from one run to the other I will know the number of groups I need to use. I built the chart below with the results:
As you can see, the distance goes dramatically down between 2 and 3, and after 3 the distance keeps going down but in a smaller scale. So the “elbow” is clearly in cluster 3. This means that I should use 3 clusters to run the algorithm. So now I’m going to run the algorithm again using the right number of clusters. This is the result:
The first column is the Customer ID and the second column is the cluster that has been assigned to that customer. So based on how customers use their mobile phones, the K-Means algorithm clustered my customers in the following way:
Notice that the clustering result is exactly aligned with the segments that I defined in my sample data.
PAL gives us the ability to measure the quality of a clustering process by providing the VALIDATEKMEANS function. This function calculates the Clusters’ Silhouette. For each point it calculates the average distance to all other points in the same cluster, this is called the average dissimilarity. Then it calculates the average dissimilarity of each point to all other clusters where the point is not a member. The cluster with the lowest average dissimilarity is said to be the "neighboring cluster", as it is, aside from the cluster it has been assigned to, the cluster in which fits best. The average distance tofellow cluster members is then compared to the average distance to members of the neighboring cluster. The resulting value is the silhouette score. The silhouette score can go from -1 to 1. A negative value means that the record is more similar to the records of its neighboring cluster than to other members of its own cluster. A score near to 0 means that the record is in the border of the two clusters. A score near to 1 means that the record has been appropriately clustered. The silhouette of the entire dataset is the average of the silhouette scores of all the individual records. So if this number is close to 1, it means we used the right number of clusters. First we need to generate the Validation Procedure using the AFL Wrapper Generator the same way we did with the Clustering Procedure:
SET SCHEMA _SYS_AFL;
/* This type is used as the input parameter of the PAL
function that contains the information you clustered */
DROP TYPE T_KMEANS_DATA_TELCO_S;
CREATE TYPE T_KMEANS_DATA_TELCO_S AS TABLE(
"ID" INT,
"AVG_CALL_DURATION" DOUBLE,
"AVG_NUMBER_CALLS_RCV_DAY" DOUBLE,
"AVG_NUMBER_CALLS_ORI_DAY" DOUBLE,
"DAY_TIME_CALLS" DOUBLE,
"WEEK_DAY_CALLS" DOUBLE,
"CALLS_TO_MOBILE" DOUBLE,
"SMS_RCV_DAY" DOUBLE,
"SMS_ORI_DAY" DOUBLE,
primary key("ID")
);
/* This type will be used as the input parameter
of the table that contains the result of the clustering process */
DROP TYPE T_KMEANS_TYPE_ASSIGN_TELCO_S;
CREATE TYPE T_KMEANS_TYPE_ASSIGN_TELCO_S AS TABLE(
"ID" INTEGER,
"TYPE_ASSIGN" INTEGER
);
/* This is the type of the output parameter that will
show the silhouette score of the entire data set */
DROP TYPE T_KMEANS_RESULT_TELCO_S;
CREATE TYPE T_KMEANS_RESULT_TELCO_S AS TABLE(
"NAME" VARCHAR (50),
"S" DOUBLE
);
/* This is the type of the table that contains the input and output parameters */
DROP TYPE CONTROL_T_TELCO_S;
CREATE TYPE CONTROL_T_TELCO_S AS TABLE(
"Name" VARCHAR(100),
"intArgs" INTEGER,
"doubleArgs" DOUBLE,
"strArgs" VARCHAR(100));
/* This is the table that contains the input and output tables */
DROP table PDATA_TELCO_S;
CREATE column table PDATA_TELCO_S(
"ID" INTEGER,
"TYPENAME" VARCHAR(100),
"DIRECTION" VARCHAR(100));
insert into PDATA_TELCO_S values (1,'_SYS_AFL.T_KMEANS_DATA_TELCO_S','in');
insert into PDATA_TELCO_S values (2,'_SYS_AFL.T_KMEANS_TYPE_ASSIGN_TELCO_S','in');
insert into PDATA_TELCO_S values (3,'_SYS_AFL.CONTROL_T_TELCO_S','in');
insert into PDATA_TELCO_S values (4,'_SYS_AFL.T_KMEANS_RESULT_TELCO_S','out');
/* Generate the PAL function by calling the AFL Wrapper Generator */
call SYSTEM.afl_wrapper_generator('ValidateTelcoKM','AFLPAL','VALIDATEKMEANS',PDATA_TELCO_S);
Now we should have a procedure in the _SYS_AFL schema called ValidateTelcoKM
Next step is writing the code to execute the validation procedure:
/* This is the table that contains the input parameters */
DROP TABLE #CONTROL_TAB_TELCO_S;
CREATE LOCAL TEMPORARY COLUMN TABLE #CONTROL_TAB_TELCO_S (
"Name" VARCHAR(100),
"intArgs" INT,
"doubleArgs" DOUBLE,
"strArgs" VARCHAR(100));
/* Fill the Parameters Table */
INSERT INTO #CONTROL_TAB_TELCO_S VALUES ('VARIABLE_NUM', 8, null, null); --> Number of Attributes used to do the segmentation
INSERT INTO #CONTROL_TAB_TELCO_S VALUES ('THREAD_NUMBER', 2, null, null); --> Number of threads to be used during the execution
/* This view shows the result of the clustering process */
DROP VIEW V_KMEANS_TYPE_ASSIGN_TELCO_S;
CREATE VIEW V_KMEANS_TYPE_ASSIGN_TELCO_S AS
SELECT "ID", "CENTER_ASSIGN" AS "TYPE_ASSIGN" FROM PAL_KMEANS_RESASSIGN_TAB_TELCO;
/* This table will contain the silhouette score of the entire data set */
DROP TABLE KMEANS_SVALUE_TAB_TELCO_S;
CREATE COLUMN TABLE KMEANS_SVALUE_TAB_TELCO_S (
"NAME" VARCHAR (50),
"S" DOUBLE
);
/* Call the Validate KMeans procedure */
CALL ValidateTelcoKM(LSPARVIERI.TELCO, V_KMEANS_TYPE_ASSIGN_TELCO_S, "#CONTROL_TAB_TELCO_S", KMEANS_SVALUE_TAB_TELCO_S) with overview;
/* Show the results */
SELECT * FROM KMEANS_SVALUE_TAB_TELCO_S;
The resulting Silhouette Score for my clustering process is:
Since this value is greater than 0 and close enough to 1, it means we used the right number of Ks when running the clustering process.
I hope you liked it!
Follow me on Twitter: @LukiSpa
Info en Español sobre SAP HANA™:
www.HablemosHANA.com
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
4 | |
4 | |
4 | |
3 | |
3 | |
3 | |
3 | |
3 | |
3 | |
2 |