Technology Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
shivamshukla12
Contributor
4,499
Hi ,

Welcome to HANA ML Python API for sequential pattern mining aka (SPM) method.I explained first four methods of association analysis in my previous blog post .

Note: Make Sure your python environment with HANA ML is up and running ,if not please follow the steps mentioned in previous blog post.

Introduction and Algorithm description:


The sequential pattern mining algorithm (SPM) searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or sub-sequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support.

Understand Sequence Pattern Mining before going into practice : 

  • Find all subsets of items that occur with a specific sequence in all other transactions:
    e.g {Playing cricket -> high ECG -> Sweating}.

  • Find all rules that correlate the order of one set of items after that another set of items in the transaction database:
    e.g 72% of users who perform a web search then make a long eye gaze
    over the ads follow that by a successful add-click.


Prerequisites:

  • The input data does not contain null value.

  • There are no duplicated items in each transaction


 

Let's Start:

Import Packages
##First, import packages needed in the data loading.
from hana_ml import dataframe
from data_load_utils import DataSets, Settings

Setup Connection

In our case, the data is loaded into a table called "PAL_APRIORI_TRANS_TBL" in HANA from a csv file "apriori_item_data.csv". To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.

***************************

[hana]
url=host-url
user=username
passwd=userpassword
port=3xx15

***************************

Maintain the login information in one config file & have it ready in your root folder
url, port, user, pwd = Settings.load_config("../config/e2edata.ini")
connection_context = dataframe.ConnectionContext(url, port,user,pwd)
print(connection_context.connection.isconnected())

If connection is successful - "True"


Data-Set:


we will analyze the store data for frequent pattern mining ,this is the sample data which is available on SAP's help webpage.

Attribute Information:

  1.  CUSTID -   Customer ID

  2.  TRANSID - Transaction ID

  3.  ITEMS - Item of Transaction


Load Data for SPM:


Check if table already exists in your schema Select * from PAL_SPM_DATA_TBL


data_tbl = DataSets.load_spm_data(connection_context)

The function DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an example of return message is shown below:

ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name: Could not find table/view

PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)')
Table PAL_SPM_DATA_TBL doesn't exist in schema DM_PAL

Creating table PAL_SPM_DATA_TBL in schema DM_PAL ....
Drop unsuccessful

Creating table DM_PAL.PAL_SPM_DATA_TBL

Data Loaded:100%

#####################

if data is already loaded into HANA:
data_tbl = DataSets.load_spm_data(connection_context)
print("Table Name is: " +str(data_tbl))
import pandas as pd

Table PAL_SPM_DATA_TBL exists and data exists
Table Name is: PAL_SPM_DATA_TBL

Create dataframes using Pandas Dataframes for data load from SAP HANA


##Create a dataframe df from PAL_SPM_TRANS_TBL for the following steps.
df_spm = pd.DataFrame(columns=['CUSTID' , 'TRANSID' , 'ITEMS'])
df_spm = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=df_spm, table_name=data_tbl, force=False, replace=True)

Re-execute the below command and check what message it returns
data_tbl = DataSets.load_spm_data(connection_context)
print("Table Name is: " +str(data_tbl))

Creating table DM_PAL.PAL_SPM_DATA_TBL
Data Loaded:100%
Table Name is: PAL_SPM_DATA_TBL


df = df_spm

Display dataframe records
df.collect().head(100) ##Display Data


df.dropna() ##Drop NAN if any of the blank record is present in your dataset

print("Toal Number of Records : " + str(df.count()))

print("Columns:")
df.columns

Data Operations:


Get insights of dataset using the following few operations , you can play more by applying few more functions to get more insights.

Filter Data:


Fetch all customers who has CUSTID = 'A'
df.filter("CUSTID = 'A'").head(10).collect()



Fetch all customers having transaction id = 1
df.filter('TRANSID = 1').head(10).collect()



Fetch all customers & transaction having item id = 'Apple'
df.filter("ITEMS = 'Apple'").head(10).collect()

Group by on Data:


GroupBy ITEMS
df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect()



GroupBy CUSTID
df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect()



GroupBy TRANSID
df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect()



 

Display the most popular items:


Import matplotlib library to display the plot having the most popular items in our dataset
import matplotlib.pyplot as plt
from wordcloud import WordCloud
plt.rcParams['figure.figsize'] = (10, 10)
wordcloud = WordCloud(background_color = 'white', width = 500, height = 500, max_words = 120).generate(str(df_spm.head(100).collect()))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Popular Items',fontsize = 10)
plt.show()

 



 

The most popular items from plot are Apple & Blueberry and if we try to find the count of transactions for both of them :

Count of Apple - df.filter("ITEMS = 'Apple'").head(100).count() = 10

Count of Blueberry - df.filter("ITEMS = 'Blueberry'").head(100).count() = 10

 

Import SPM Method from HANA ML Library:


from hana_ml.algorithms.pal.association import SPM

Setup SPM instance:


sp = SPM(conn_context=connection_context,
min_support=0.5,
relational=False,
ubiquitous=1.0,
max_len=10,
min_len=1,
calc_lift=True)

Method

fit(datacustomer=Nonetransaction=Noneitem=Noneitem_restrict=Nonemin_gap=None)

Sequential pattern mining from input data.Execute the method

Training of model with dataset.
sp.fit(data=df_spm, customer='CUSTID', transaction='TRANSID', item='ITEMS')

Collect the result
sp.result_.collect()



Result Analysis:

  • Item-set Apple has support 1.0 indicates the frequency of the item in all the transactions , most frequent item - confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them

  • Consider (Apple , Blueberry): Support is .88 (Frequency of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in their bucket , lift is .89 close to 1 indicates high Association among items purchase.

  • Result displays most of the items are having support , Confidence & Lift more than 70% indicates the mining of highly frequent items & can be considered strong association rules.


 

PAL Reading:


SAP HANA Predictive Analysis Library (PAL)


Note - I will be attaching python notebook and dataset soon stay tuned.

 

 

Feedback/Suggestions/Questions are most welcome 🙂

Thanks,
Shivam
3 Comments