Enterprise Resource Planning Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Product and Topic Expert
Product and Topic Expert
0 Kudos
835

Introduction

Most traditional machine learning models are designated for structured data, e. g. tabular data with features and dependent variables. Text, on the other hand, is a typical instance of unstructured data. However, text data can be tabularized through the text vectorization, albeit with some information loss. One typical technique for text vectorization is the renowned term-frequency-inverse-document-frequency (TF-IDF) analysis, which is adopted in SAP HANA Predictive Analysis Library (PAL) for text mining. The TF-IDF transformed text data, together with the labels, can be fed into traditional classification model

infrastructures directly to train a deployable classifier. The trained classifier can then be preceded by the TF-IDF transformer (derived from the statistics of the TF-IDF analysis of the original text data) to form a pipeline model for predicting the tags/labels of future incoming texts. This entire workflow for text classification can then be illustrated as follows:

                Text Input → TF-IDF Vectorizer → Classifier → Result

The classifier employed in the workflow above is usually k-nearest-neighbor (KNN). One of the key advantages for KNN classifier is that it essentially requires no training phase. However, at inference time for each piece of query text it needs to scan over the whole TF-IDF transformed data (or some pre-built tree structure) for similarity search, which is both computation-dense and time-consuming. This claim is especially true when the incoming text data for prediction is of high volume, for example if we use a text classifier as spam-email filter, very likely it will face a very large number of emails that are suspicious to be spam.

To accelerate the inference speed of text classification, we have enhanced the capability of text classification in SAP HANA Predictive Analysis Library (PAL): an explicit training phase is introduced following the TF-IDF analysis, which allows users to train a Random Decision Trees (RDT) model in this phase and then deployed it for future inference. Random Decision Trees Classifier is selected in our text classification pipeline because its ensembled tree structure usually leads to improved accuracy and overfitting reduction; besides, since each tree can be built independently, and later does inference independently as well (before results being aggregated), the entire workflow is highly parallelizable. In a nutshell, by training an RDT model with a small amount extra effort, we may enjoy a much faster inference speed, and possibly with some extent of improved prediction accuracy as well.

Background Knowledge

The content of in the introduction section can be better understood with some background knowledge, especially on topics like text vectorization, TF-IDF analysis, KNN classifier and Random Decision Trees classifier. In this section we will provide some basic description of those topics.

Text Vectorization

Text vectorization is also known as text representation or text encoding. It refers to the preprocess of converting textual data into numerical vectors, so that machine learning models can understand and process. Since most machine learning algorithms work on numerical data in essence, text vectorization is a critical step for analyzing and modeling text data. The most commonly use techniques for text vectorization include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word Embeddings, etc.

TF-IDF Analysis

TF-IDF (Term Frequency-Inverse Document Frequency) analysis is a technique used in natural language processing to assess the importance of words in a document relative to a corpus (i.e., a collection of text documents of interest). It is a numerical statistic that reflects how frequently a word appears in a document and how unique it is to that document among all the other documents in the corpus. As a possible preprocessing step, some less meaningful words/ components, like punctuations marks, stop-words (e. g. articles, prepositions, conjunctions, pronouns) and numbers may be excluded from the analysis. Then given a document, each remining word is assigned a value which is result of the multiplication of two values: term frequency (TF) and inverse document frequency (IDF)

  1. Term frequency (TF) is the frequency of a word in a document, calculated by dividing the number of occurrences of a word in a document by the total number of words in that document.
  2. Inverse document frequency (IDF) is calculated by dividing the total number of documents in the corpus by the number of documents that contain the specific word. It measures the rarity or uniqueness of a word in the entire corpus and is a constant value for that word relative to the corpus.

The resulting value is referred as TF-IDF value of that word w. r. t. the document. Then, TF-IDF vectorization is simply to a process of representing a document by its vector of TF-IDF values for all words in the entire corpus. Note that if a word which does not appear in a document, by definition its TF-IDF value is 0 for that document, so there are usually many zeros in TF-IDF vectors for documents. In other words, TF-IDF vectorization tends to produce sparse representations for documents.

KNN Classifier

The KNN classifier is a non-parametric classification algorithm, it determines the class of a new data point by considering its k nearest neighbors in the training set, where k is some prespecified odd number. KNN classifier works as follows:

  1. Training phase: the KNN algorithm simply stores the feature vectors and their corresponding class labels from the training set, i.e., without explicit model training or data compression.
  2. Prediction phase: for a new data point represented by a feature vector, the algorithm calculates the distances between the new point and all the points in the training set, and then selects the k nearest neighbors based on the sorted distances. The predicted class label of the new data point is then simply the most common label among those of the selected neighbors.

RDT Classifier

Random decision tree (RDT) classifier is an ensemble learning method that combines multiple decision tree classifiers to improve prediction accuracy. Its nature of randomness mainly originates from the following two aspects during the training phase:

  1. Each decision tree in RDT model is built upon a random subset of the original training data.
  2. Each decision tree is built by randomly selecting a subset of features at each node for split.

 

The RDT classifier offers several advantages: it can handle large datasets with high dimensionality, it is resistant to overfitting, and it can effectively handle missing data. It is also insensitive to feature scaling thanks to its tree-based nature. Besides, since each tree is can be built independently and do inference independently as well, both its training and inference processes are highly parallelizable.

Overall, the RDT classifier is a powerful and versatile machine learning algorithm that can be used for various classification problems and often produces accurate predictions.

Example

In the section we show a simple example to demonstrate the effectiveness of RDT model for text label inference. We use the Text Emotion Recognition dataset for demonstration. This dataset is consisted of many pieces of text content that are associated with human emotions (joy or sad), illustrated as follows:

likun_hou_0-1718258150041.png

The whole dataset has 282822 rows, after performing a stratified 1v9 train-test split, we have 28282 texts for training and the rest 254540 for testing. The choice resembles the case when the data for inference of is of high volume (w. r. t. the volume of the training data).

Key statistics of our experiments are listed as follows (classifiers are grouped for comparison based on their similar performance on the test dataset, with execution times calculated by averaging over 10 repetitions of experiment):

 

Classifier

Time for TF-IDF Analysis and Model Build (in seconds)

Time for Inference (in seconds)

Test Accuracy

KNN(k=1)

2.67

42.91

0.7664

RDT(max_depth=60, n_estimators=20)

9.97

17.99

0.8455

 

KNN(k=11)

2.67

45.89

0.8853

RDT(max_depth=200, n_estimators=20)

17.44

18.31

0.9055

 

KNN(k=101)

2.67

79.48

0.9193

RDT(max_depth=1000, n_estimators=20)

19.74

18.77

 

0.9278

 

 

[Note: Due to randomness, the test accuracy value of RDT classifier recorded in all tables above is the second worst-value over the 10 repetitions of experiment, which ensures that RDT model performs no worse than KNN model with probability 0.9 approximately.]

It is seen from the above table that, the training phase of KNN is much shorter than that of RDT, while the situation gets reverted for the inference phase. In all three cases, compared with KNN, the added time for RDT model training is compensated by the time saved in the inference phase. In particular,  the inference time for RDT model does not increase significantly as its base learner gets more sophisticated (in this case, gets deeper), which is clearly advantageous compared to KNN.

Conclusion

KNN classifier could be an inefficient component for text classification, especially in the speed of inference. In SAP HANA PAL, a new option is provided, which allows users to train an RDT model instead for text classification. Consequently, the inference speed of text classification can be greatly improved. It suits particularly well the case when the cost of training an RDT model is small or moderate (e. g. the training corpus contains only a relatively small number meaningful terms), while the inference demand is comparatively high.

Code Samples

In the following context we show users how to apply RDT model for text classification through the Python machine learning client for SAP HANA (hana-ml).

In hana-ml, the newly introduced training and inferencing functionality for text classification is wrapped in a new class called TextClassificationWithModel. In the following code samples (with comments), text_train_data is assumed to be the train data for building the TF-IDF vectorizer and RDT classifier, while text_test_data the dataset for performance of the composite model (i. e. TF-IDF vectorizer plus RDT classifier).

 

 

from hana-ml.text.tm import TextClassificationWithModel
# Step 1. Initialize a TextClassification instance, including basic configs for TF-IDF vectorizer and RDT classifier
tc = TextClassificationWithModel(lang=’en’, enable_stopwords=True, keep_numeric=False, n_estimators=10, max_depth=100) 
# Step 2. Call TextClassification fit() method with training data as input, which is equivalent to applying TF-IDF analysis to the training data firstly, and then building an RDT classifier upon the TF-IDF vectorized training data. 
tc.fit(data=text_train_data, thread_ratio=1.0) #using full threads since RDT fit is highly parallelizable.  
# Step 3. Predict the labels the text data with the previously trained instance of TextClassification using TF-IDF vectorization and RDT classification.
prediction = tc.predict(data=text_test_data, thread_ratio=1.0) #using full threads since RDT inference is also highly parallelizable.

 

In contrast, SAP HANA PAL also provides traditional workflow for text classification, i. e., TF-IDF vectorizer plus KNN classifier, which is also wrapped in hana-ml. The following code samples demonstrate this functionality can be employed for text classification:

 

 

from hana_ml.text.tm import text_analysis, text_classification
# Step 1. Apply TF-IDF analysis to the training data, with result for both TF-IDF vectorization and KNN classification.
train_text_tf_res = text_analysis(data=text_train_data, thread_ratio=1.0) 
# Step 2. Input prediction data is TF-IDF vectorized and then passed into a KNN classifier, with TF-IDF vectorized training text data as the reference data.  
prediction = text_classification(pred=text_test_data, ref_data=train_text_tf_res, thread_ratio=1.0) 

 

 

Useful Links

Global Explanation Capabilities in SAP HANA Machine Learning

Exploring ML Explainability in SAP HANA PAL – Classification and Regression

Fairness in Machine Learning - A New Feature in SAP HANA Cloud PAL

A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SA...

Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for ...

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client...

Python Machine Learning Client for SAP HANA

Import multiple excel files into a single SAP HANA table

COPD study, explanation and interpretability with Python machine learning client for SAP HANA

Model Storage with Python Machine Learning Client for SAP HANA

Identification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA