SAP HANA ML Python APIs : Sequential Pattern Minin...

shivamshukla12 · ‎2019 Sep 05

Hi ,

Welcome to HANA ML Python API for sequential pattern mining aka (SPM) method.I explained first four methods of association analysis in my previous blog post .

Note: Make Sure your python environment with HANA ML is up and running ,if not please follow the steps mentioned in previous blog post.

Introduction and Algorithm description:

The sequential pattern mining algorithm (SPM) searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or sub-sequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support.

Understand Sequence Pattern Mining before going into practice :

Find all subsets of items that occur with a specific sequence in all other transactions:
e.g {Playing cricket -> high ECG -> Sweating}.

Find all rules that correlate the order of one set of items after that another set of items in the transaction database:
e.g 72% of users who perform a web search then make a long eye gaze
over the ads follow that by a successful add-click.

Prerequisites:

The input data does not contain null value.

There are no duplicated items in each transaction

Let's Start:

Import Packages

##First, import packages needed in the data loading.

from hana_ml import dataframe

from data_load_utils import DataSets, Settings

Setup Connection

In our case, the data is loaded into a table called "PAL_APRIORI_TRANS_TBL" in HANA from a csv file "apriori_item_data.csv". To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.

***************************

[hana]
url=host-url
user=username
passwd=userpassword
port=3xx15

***************************

Maintain the login information in one config file & have it ready in your root folder

url, port, user, pwd = Settings.load_config("../config/e2edata.ini")

connection_context = dataframe.ConnectionContext(url, port,user,pwd)

print(connection_context.connection.isconnected())

If connection is successful - "True"

Data-Set:

we will analyze the store data for frequent pattern mining ,this is the sample data which is available on SAP's help webpage.

Attribute Information:

CUSTID - Customer ID

TRANSID - Transaction ID

ITEMS - Item of Transaction

Load Data for SPM:

Check if table already exists in your schema Select * from PAL_SPM_DATA_TBL

data_tbl = DataSets.load_spm_data(connection_context)

The function DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an example of return message is shown below:

ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name: Could not find table/view

PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)')
Table PAL_SPM_DATA_TBL doesn't exist in schema DM_PAL

Creating table PAL_SPM_DATA_TBL in schema DM_PAL ....
Drop unsuccessful

Creating table DM_PAL.PAL_SPM_DATA_TBL

Data Loaded:100%

#####################

if data is already loaded into HANA:

data_tbl = DataSets.load_spm_data(connection_context)

print("Table Name is: " +str(data_tbl))

import pandas as pd

Table PAL_SPM_DATA_TBL exists and data exists

Table Name is: PAL_SPM_DATA_TBL

Create dataframes using Pandas Dataframes for data load from SAP HANA

##Create a dataframe df from PAL_SPM_TRANS_TBL for the following steps.

df_spm = pd.DataFrame(columns=['CUSTID' , 'TRANSID' , 'ITEMS'])

df_spm = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=df_spm, table_name=data_tbl, force=False, replace=True)

Re-execute the below command and check what message it returns

data_tbl = DataSets.load_spm_data(connection_context)

print("Table Name is: " +str(data_tbl))

Creating table DM_PAL.PAL_SPM_DATA_TBL

Data Loaded:100%

Table Name is: PAL_SPM_DATA_TBL

df = df_spm

Display dataframe records

df.collect().head(100) ##Display Data

df.dropna() ##Drop NAN if any of the blank record is present in your dataset

print("Toal Number of Records : " + str(df.count()))

print("Columns:")

df.columns

Data Operations:

Get insights of dataset using the following few operations , you can play more by applying few more functions to get more insights.

Filter Data:

Fetch all customers who has CUSTID = 'A'

df.filter("CUSTID = 'A'").head(10).collect()

Fetch all customers having transaction id = 1

df.filter('TRANSID = 1').head(10).collect()

Fetch all customers & transaction having item id = 'Apple'

df.filter("ITEMS = 'Apple'").head(10).collect()

Group by on Data:

GroupBy ITEMS

df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect()

GroupBy CUSTID

df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect()

GroupBy TRANSID

df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect()

Display the most popular items:

Import matplotlib library to display the plot having the most popular items in our dataset

import matplotlib.pyplot as plt

from wordcloud import WordCloud

plt.rcParams['figure.figsize'] = (10, 10)

wordcloud = WordCloud(background_color = 'white', width = 500,  height = 500, max_words = 120).generate(str(df_spm.head(100).collect()))

plt.imshow(wordcloud)

plt.axis('off')

plt.title('Most Popular Items',fontsize = 10)

plt.show()

The most popular items from plot are Apple & Blueberry and if we try to find the count of transactions for both of them :

Count of Apple - df.filter("ITEMS = 'Apple'").head(100).count() = 10

Count of Blueberry - df.filter("ITEMS = 'Blueberry'").head(100).count() = 10

Import SPM Method from HANA ML Library:

from hana_ml.algorithms.pal.association import SPM

Setup SPM instance:

sp = SPM(conn_context=connection_context,

             min_support=0.5,

             relational=False,

             ubiquitous=1.0,

             max_len=10,

             min_len=1,

             calc_lift=True)

Method

fit(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None)

Sequential pattern mining from input data.Execute the method

Training of model with dataset.

sp.fit(data=df_spm, customer='CUSTID', transaction='TRANSID', item='ITEMS')

Collect the result

sp.result_.collect()

Result Analysis:

Item-set Apple has support 1.0 indicates the frequency of the item in all the transactions , most frequent item - confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them

Consider (Apple , Blueberry): Support is .88 (Frequency of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in their bucket , lift is .89 close to 1 indicates high Association among items purchase.

Result displays most of the items are having support , Confidence & Lift more than 70% indicates the mining of highly frequent items & can be considered strong association rules.

PAL Reading:

SAP HANA Predictive Analysis Library (PAL)

Note - I will be attaching python notebook and dataset soon stay tuned.

Feedback/Suggestions/Questions are most welcome 🙂

Thanks,
Shivam

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile