Hi ,
Welcome to HANA ML Python API for sequential pattern mining aka (SPM) method.I explained first four methods of association analysis in my previous
blog post .
Note: Make Sure your python environment with HANA ML is up and running ,if not please follow the steps mentioned in previous
blog post.
Introduction and Algorithm description:
The sequential pattern mining algorithm (SPM) searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or sub-sequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support.
Understand Sequence Pattern Mining before going into practice :
- Find all subsets of items that occur with a specific sequence in all other transactions:
e.g {Playing cricket -> high ECG -> Sweating}.
- Find all rules that correlate the order of one set of items after that another set of items in the transaction database:
e.g 72% of users who perform a web search then make a long eye gaze
over the ads follow that by a successful add-click.
Prerequisites:
- The input data does not contain null value.
- There are no duplicated items in each transaction
Let's Start:
Import Packages
##First, import packages needed in the data loading.
from hana_ml import dataframe
from data_load_utils import DataSets, Settings
Setup Connection
In our case, the data is loaded into a table called "PAL_APRIORI_TRANS_TBL" in HANA from a csv file "apriori_item_data.csv". To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.
***************************
[hana]
url=host-url
user=username
passwd=userpassword
port=3xx15
***************************
Maintain the login information in one config file & have it ready in your root folder
url, port, user, pwd = Settings.load_config("../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port,user,pwd)
print(connection_context.connection.isconnected())
If connection is successful - "True"
Data-Set:
we will analyze the store data for frequent pattern mining ,this is the sample data which is available on
SAP's help webpage.
Attribute Information:
- CUSTID - Customer ID
- TRANSID - Transaction ID
- ITEMS - Item of Transaction
Load Data for SPM:
Check if table already exists in your schema Select * from PAL_SPM_DATA_TBL
data_tbl = DataSets.load_spm_data(connection_context)
The function
DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an example of return message is shown below:
ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name: Could not find table/view
PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)')
Table PAL_SPM_DATA_TBL doesn't exist in schema DM_PAL
Creating table PAL_SPM_DATA_TBL in schema DM_PAL ....
Drop unsuccessful
Creating table DM_PAL.PAL_SPM_DATA_TBL
Data Loaded:100%
#####################
if data is already loaded into HANA:
data_tbl = DataSets.load_spm_data(connection_context)
print("Table Name is: " +str(data_tbl))
import pandas as pd
Table PAL_SPM_DATA_TBL exists and data exists
Table Name is: PAL_SPM_DATA_TBL
Create dataframes using Pandas Dataframes for data load from SAP HANA
##Create a dataframe df from PAL_SPM_TRANS_TBL for the following steps.
df_spm = pd.DataFrame(columns=['CUSTID' , 'TRANSID' , 'ITEMS'])
df_spm = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=df_spm, table_name=data_tbl, force=False, replace=True)
Re-execute the below command and check what message it returns
data_tbl = DataSets.load_spm_data(connection_context)
print("Table Name is: " +str(data_tbl))
Creating table DM_PAL.PAL_SPM_DATA_TBL
Data Loaded:100%
Table Name is: PAL_SPM_DATA_TBL
df = df_spm
Display dataframe records
df.collect().head(100) ##Display Data
df.dropna() ##Drop NAN if any of the blank record is present in your dataset
print("Toal Number of Records : " + str(df.count()))
print("Columns:")
df.columns
Data Operations:
Get insights of dataset using the following few operations , you can play more by applying few more functions to get more insights.
Filter Data:
Fetch all customers who has CUSTID = 'A'
df.filter("CUSTID = 'A'").head(10).collect()
Fetch all customers having transaction id = 1
df.filter('TRANSID = 1').head(10).collect()
Fetch all customers & transaction having item id = 'Apple'
df.filter("ITEMS = 'Apple'").head(10).collect()
Group by on Data:
GroupBy
ITEMS
df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect()
GroupBy
CUSTID
df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect()
GroupBy
TRANSID
df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect()
Display the most popular items:
Import matplotlib library to display the plot having the most popular items in our dataset
import matplotlib.pyplot as plt
from wordcloud import WordCloud
plt.rcParams['figure.figsize'] = (10, 10)
wordcloud = WordCloud(background_color = 'white', width = 500, height = 500, max_words = 120).generate(str(df_spm.head(100).collect()))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Popular Items',fontsize = 10)
plt.show()
The most popular items from plot are Apple & Blueberry and if we try to find the count of transactions for both of them :
Count of Apple - df.filter("ITEMS = 'Apple'").head(100).count() = 10
Count of Blueberry - df.filter("ITEMS = 'Blueberry'").head(100).count() = 10
Import SPM Method from HANA ML Library:
from hana_ml.algorithms.pal.association import SPM
Setup SPM instance:
sp = SPM(conn_context=connection_context,
min_support=0.5,
relational=False,
ubiquitous=1.0,
max_len=10,
min_len=1,
calc_lift=True)
Method
fit(data,
customer=None,
transaction=None,
item=None,
item_restrict=None,
min_gap=None)
Sequential pattern mining from input data.Execute the method
Training of model with dataset.
sp.fit(data=df_spm, customer='CUSTID', transaction='TRANSID', item='ITEMS')
Collect the result
sp.result_.collect()
Result Analysis:
- Item-set Apple has support 1.0 indicates the frequency of the item in all the transactions , most frequent item - confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them
- Consider (Apple , Blueberry): Support is .88 (Frequency of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in their bucket , lift is .89 close to 1 indicates high Association among items purchase.
- Result displays most of the items are having support , Confidence & Lift more than 70% indicates the mining of highly frequent items & can be considered strong association rules.
PAL Reading:
SAP HANA Predictive Analysis Library (PAL)
Note - I will be attaching python notebook and dataset soon stay tuned.
Feedback/Suggestions/Questions are most welcome
🙂
Thanks,
Shivam