Introduction
Most traditional machine learning models are designated for structured data, e. g. tabular data with features and dependent variables. Text, on the other hand, is a typical instance of unstructured data. However, text data can be tabularized through the text vectorization, albeit with some information loss. One typical technique for text vectorization is the renowned term-frequency-inverse-document-frequency (TF-IDF) analysis, which is adopted in SAP HANA Predictive Analysis Library (PAL) for text mining. The TF-IDF transformed text data, together with the labels, can be fed into traditional classification model
infrastructures directly to train a deployable classifier. The trained classifier can then be preceded by the TF-IDF transformer (derived from the statistics of the TF-IDF analysis of the original text data) to form a pipeline model for predicting the tags/labels of future incoming texts. This entire workflow for text classification can then be illustrated as follows:
Text Input → TF-IDF Vectorizer → Classifier → Result
The classifier employed in the workflow above is usually k-nearest-neighbor (KNN). One of the key advantages for KNN classifier is that it essentially requires no training phase. However, at inference time for each piece of query text it needs to scan over the whole TF-IDF transformed data (or some pre-built tree structure) for similarity search, which is both computation-dense and time-consuming. This claim is especially true when the incoming text data for prediction is of high volume, for example if we use a text classifier as spam-email filter, very likely it will face a very large number of emails that are suspicious to be spam.
To accelerate the inference speed of text classification, we have enhanced the capability of text classification in SAP HANA Predictive Analysis Library (PAL): an explicit training phase is introduced following the TF-IDF analysis, which allows users to train a Random Decision Trees (RDT) model in this phase and then deployed it for future inference. Random Decision Trees Classifier is selected in our text classification pipeline because its ensembled tree structure usually leads to improved accuracy and overfitting reduction; besides, since each tree can be built independently, and later does inference independently as well (before results being aggregated), the entire workflow is highly parallelizable. In a nutshell, by training an RDT model with a small amount extra effort, we may enjoy a much faster inference speed, and possibly with some extent of improved prediction accuracy as well.
Background Knowledge
The content of in the introduction section can be better understood with some background knowledge, especially on topics like text vectorization, TF-IDF analysis, KNN classifier and Random Decision Trees classifier. In this section we will provide some basic description of those topics.
Text Vectorization
Text vectorization is also known as text representation or text encoding. It refers to the preprocess of converting textual data into numerical vectors, so that machine learning models can understand and process. Since most machine learning algorithms work on numerical data in essence, text vectorization is a critical step for analyzing and modeling text data. The most commonly use techniques for text vectorization include Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word Embeddings, etc.
TF-IDF Analysis
TF-IDF (Term Frequency-Inverse Document Frequency) analysis is a technique used in natural language processing to assess the importance of words in a document relative to a corpus (i.e., a collection of text documents of interest). It is a numerical statistic that reflects how frequently a word appears in a document and how unique it is to that document among all the other documents in the corpus. As a possible preprocessing step, some less meaningful words/ components, like punctuations marks, stop-words (e. g. articles, prepositions, conjunctions, pronouns) and numbers may be excluded from the analysis. Then given a document, each remining word is assigned a value which is result of the multiplication of two values: term frequency (TF) and inverse document frequency (IDF)
The resulting value is referred as TF-IDF value of that word w. r. t. the document. Then, TF-IDF vectorization is simply to a process of representing a document by its vector of TF-IDF values for all words in the entire corpus. Note that if a word which does not appear in a document, by definition its TF-IDF value is 0 for that document, so there are usually many zeros in TF-IDF vectors for documents. In other words, TF-IDF vectorization tends to produce sparse representations for documents.
KNN Classifier
The KNN classifier is a non-parametric classification algorithm, it determines the class of a new data point by considering its k nearest neighbors in the training set, where k is some prespecified odd number. KNN classifier works as follows:
RDT Classifier
Random decision tree (RDT) classifier is an ensemble learning method that combines multiple decision tree classifiers to improve prediction accuracy. Its nature of randomness mainly originates from the following two aspects during the training phase:
The RDT classifier offers several advantages: it can handle large datasets with high dimensionality, it is resistant to overfitting, and it can effectively handle missing data. It is also insensitive to feature scaling thanks to its tree-based nature. Besides, since each tree is can be built independently and do inference independently as well, both its training and inference processes are highly parallelizable.
Overall, the RDT classifier is a powerful and versatile machine learning algorithm that can be used for various classification problems and often produces accurate predictions.
Example
In the section we show a simple example to demonstrate the effectiveness of RDT model for text label inference. We use the Text Emotion Recognition dataset for demonstration. This dataset is consisted of many pieces of text content that are associated with human emotions (joy or sad), illustrated as follows:
The whole dataset has 282822 rows, after performing a stratified 1v9 train-test split, we have 28282 texts for training and the rest 254540 for testing. The choice resembles the case when the data for inference of is of high volume (w. r. t. the volume of the training data).
Key statistics of our experiments are listed as follows (classifiers are grouped for comparison based on their similar performance on the test dataset, with execution times calculated by averaging over 10 repetitions of experiment):
Classifier | Time for TF-IDF Analysis and Model Build (in seconds) | Time for Inference (in seconds) | Test Accuracy |
KNN(k=1) | 2.67 | 42.91 | 0.7664 |
RDT(max_depth=60, n_estimators=20) | 9.97 | 17.99 | 0.8455 |
| |||
KNN(k=11) | 2.67 | 45.89 | 0.8853 |
RDT(max_depth=200, n_estimators=20) | 17.44 | 18.31 | 0.9055 |
| |||
KNN(k=101) | 2.67 | 79.48 | 0.9193 |
RDT(max_depth=1000, n_estimators=20) | 19.74 | 18.77 |
0.9278
|
[Note: Due to randomness, the test accuracy value of RDT classifier recorded in all tables above is the second worst-value over the 10 repetitions of experiment, which ensures that RDT model performs no worse than KNN model with probability 0.9 approximately.]
It is seen from the above table that, the training phase of KNN is much shorter than that of RDT, while the situation gets reverted for the inference phase. In all three cases, compared with KNN, the added time for RDT model training is compensated by the time saved in the inference phase. In particular, the inference time for RDT model does not increase significantly as its base learner gets more sophisticated (in this case, gets deeper), which is clearly advantageous compared to KNN.
Conclusion
KNN classifier could be an inefficient component for text classification, especially in the speed of inference. In SAP HANA PAL, a new option is provided, which allows users to train an RDT model instead for text classification. Consequently, the inference speed of text classification can be greatly improved. It suits particularly well the case when the cost of training an RDT model is small or moderate (e. g. the training corpus contains only a relatively small number meaningful terms), while the inference demand is comparatively high.
Code Samples
In the following context we show users how to apply RDT model for text classification through the Python machine learning client for SAP HANA (hana-ml).
In hana-ml, the newly introduced training and inferencing functionality for text classification is wrapped in a new class called TextClassificationWithModel. In the following code samples (with comments), text_train_data is assumed to be the train data for building the TF-IDF vectorizer and RDT classifier, while text_test_data the dataset for performance of the composite model (i. e. TF-IDF vectorizer plus RDT classifier).
from hana-ml.text.tm import TextClassificationWithModel
# Step 1. Initialize a TextClassification instance, including basic configs for TF-IDF vectorizer and RDT classifier
tc = TextClassificationWithModel(lang=’en’, enable_stopwords=True, keep_numeric=False, n_estimators=10, max_depth=100)
# Step 2. Call TextClassification fit() method with training data as input, which is equivalent to applying TF-IDF analysis to the training data firstly, and then building an RDT classifier upon the TF-IDF vectorized training data.
tc.fit(data=text_train_data, thread_ratio=1.0) #using full threads since RDT fit is highly parallelizable.
# Step 3. Predict the labels the text data with the previously trained instance of TextClassification using TF-IDF vectorization and RDT classification.
prediction = tc.predict(data=text_test_data, thread_ratio=1.0) #using full threads since RDT inference is also highly parallelizable.
In contrast, SAP HANA PAL also provides traditional workflow for text classification, i. e., TF-IDF vectorizer plus KNN classifier, which is also wrapped in hana-ml. The following code samples demonstrate this functionality can be employed for text classification:
from hana_ml.text.tm import text_analysis, text_classification
# Step 1. Apply TF-IDF analysis to the training data, with result for both TF-IDF vectorization and KNN classification.
train_text_tf_res = text_analysis(data=text_train_data, thread_ratio=1.0)
# Step 2. Input prediction data is TF-IDF vectorized and then passed into a KNN classifier, with TF-IDF vectorized training text data as the reference data.
prediction = text_classification(pred=text_test_data, ref_data=train_text_tf_res, thread_ratio=1.0)
Global Explanation Capabilities in SAP HANA Machine Learning
Exploring ML Explainability in SAP HANA PAL – Classification and Regression
Fairness in Machine Learning - A New Feature in SAP HANA Cloud PAL
Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA
Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA
Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
Python Machine Learning Client for SAP HANA
Import multiple excel files into a single SAP HANA table
COPD study, explanation and interpretability with Python machine learning client for SAP HANA
Model Storage with Python Machine Learning Client for SAP HANA
Identification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
17 | |
5 | |
3 | |
3 | |
3 | |
3 | |
3 | |
2 | |
2 | |
2 |