1. Background
In this fast-pacing information era, data keeps increasing in both volume and modality. For examples, data collected from an e-commerce website usually contain both structured (numerical and categorical) data (e. g. product type, rating, recommendation index) and non-structured data (e. g. product name, product description in both text and image, review comments). The collected data could be applied to mining and prediction analysis tasks, e.g. product recommendation. In a typical case of such kind, the non-structured data could persist of a high proportion of information for the targeted task. Nowadays, it is highly demanding that one can fully utilize the multi-modality of data in prediction analysis tasks, i.e. combine the information within both the structured and non-structured data. Thanks to the recent advances in deep learning, today one can leverage the power of non-structured data like texts and videos into structured real vectors of pre-defined dimension using suitable embedding models, hence facilitating the use of text data with structured data for various hybrid prediction analysis tasks.
In this blogpost, we mainly focus on predictive analysis tasks with hybrid structured data and text embeddings. We show how embedding data is supported by a very popular machine learning algorithm called hybrid gradient boosting trees and conduct simple analysis on the pros and cons in using embedding data through a concrete example.
The Predictive Analysis Library (PAL) in the SAP HANA Cloud Q4 2024 supports multiple classification and regression algorithms enabled to process embedding vector as features, like Hybrid Gradient Boosting Tree (HGBT) and Multi-Task MLP.
2. Example Use Case of Hybrid Predictive Analysis with Text Embeddings
2.1 Dataset Description and Proposed Task
In the blogpost we the (modified) wine review dataset (originally collected by Zack Thoutt) for demonstration. This dataset contains numerical & category features like sale price, country & province of origin, and a text feature for describing the wine which is non-structured. To facilitate the use of this non-structured text feature, we use Predictive Analysis Library (PAL) in SAP HANA Cloud provided text embedding model to convert each piece of description text into a real vector of dimension 768. Then the text embeddings could be used just like other structured features for subsequent predictive analysis. The proposed task in this blogpost is to predict the variety of wine using the tabular features mentioned together with embedding vectors of the non-structured description texts.
Key statistics of the experimental dataset are presented in the following table:
Statistics Name | Statistics Value | Comment |
Data Size (Train) | 84123 |
|
Data Size (Test) | 21031 |
|
No. of Features (w. o. Embeddings) | 4 | 2 numerical, 2 categorical |
Dimension of Text Embeddings | 768 | Generated by PAL fine-tuned model for text embeddings |
No. of Classes | 30 | Classes are not balanced |
2.2 Experimental Results
We run two sets of experiments for wine variety prediction, the 1st one is a baseline experiment without using text embeddings (i.e. with tabular features only), while the 2nd one is to predict with embedding vectors only, and the 3rd one is to predict using hybrid data (i.e. with both tabular features and text embedding). The results of the experiments are summarized as follows:
For illustrative purpose, in the Figure above we also include the baseline accuracy score of random guess (i.e. assuming uniform distribution of classes), as well as that of the majority class dominant predictor (i. e. predicting all samples in test data belong to the majority classes of the training data).
Comparing the two baseline accuracy scores with the prediction scores produced by HGBT models, we can see that both tabular features and text embeddings provide some useful information related to its target, i.e. the variety of wine. The hybrid prediction using both the tabular features as well as the text embedding vectors yields the best and significantly improved accuracy score on the test dataset.
However, it is also seen from the above that the introduction of text embeddings as a predictive feature could potentially increase the training time for the target models significantly. In our example the training time for models involving text embeddings is roughly 30 times the training time for the model without using text embeddings. Besides, to have text embeddings get involved in the training data also significantly increases the storage consumption. For the training set of wine data, its storage consumption for all features without text embeddings is only around 2.5Mb, while its text embedding data can be as large as over 250Mb.
The good news is that, as we shall see in another blogpost, dimensionality reduction techniques (like PCA) could be applied to partially alleviate those pain points. In this example, if vector dimensionality is reduced from 768 to 64 using VECPCA in PAL, test accuracy for the hybrid scenario is at 0.774, training time at around 300 seconds and storage reduced to from over 250 to slightly over 50 MB.
2.3 Code Recipes using Python Machine Learning client for SAP HANA
In this subsection we include the python code for reproducing the main results stated in the previous section, using the python machine learning client for SAP HANA (hana-ml).
>>> wine_train_df.count(), wine_test_df.count()
... (84123, 21031)
>>> wine_train_df.columns
...['ID', 'PAL_EMBEDDINGS', 'country', 'points', 'price', 'province', 'variety']
>>> from hana_ml.algorithms.pal.trees import HybridGradientBoostingClassifier as HGBC
>>> hgc = HGBC(n_estimators=1000,
random_state=2023,
validation_set_rate=0.1,
tolerant_iter_num=10,
learning_rate=0.1,
adopt_prior=True,
max_depth=4,
lamb=1.0,
stratified_validation_set=True)
>>> #Fit a HGBT model with text embeddings and score on the test data
>>> hgc.fit(wine_train_df, key='ID', label='variety')
>>> hgc.score(wine_test_df, key='ID', label='variety')
>>> #Fit a HGBT model without text embeddings and score on the test data
>>> hgc.fit(wine_train_df.deselect('PAL_EMBEDDINGS'),
key='ID',
label='variety')
>>> hgc.score(wine_test_df, key='ID', label='variety')
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 46 | |
| 45 | |
| 39 | |
| 39 | |
| 30 | |
| 28 | |
| 27 | |
| 25 | |
| 24 | |
| 23 |