1. Introduction
Text, as a typical type of non-structured data, can be seen everywhere. In many application scenarios, text usually comes along with structured or tabular (numerical and categorical) data. However, many classical data mining algorithms only target structured data. To unveil the potential power in the hybrid data containing both text and tabular inputs, it is crucial to convert the non-structured text inputs into structured formats. Traditionally, texts are usually converted to real vectors through word-counting based approaches. However, those traditional approaches are usually considered insufficient for representing the semantic meaning of the text inputs since they commonly ignored the order and relations of words that appeared in texts. In recent years, we have witnessed the emergence of large language models (LLMs), which can process texts so effectively and it is beyond many people’s wildest imagination. LLMs are pre-trained neural networks which accept raw texts as input. In the forward propagation process of the neural network, raw texts are firstly tokenized and then passed into subsequent layers to create more semantic related representations called embedding vectors (or simply text embeddings). Embedding vectors are passed into the final layers of neural networks that are usually task specific.
In many machine learning tasks involving hybrid input data, text embeddings have already shown advantages against its predecessors. However, current general-purpose models for text embeddings usually produce embedding vectors of dimension ranging from a few hundreds to tens of hundreds, which is usually a bit higher than most classical machine learning algorithms can efficiently handle. In such cases, dimensionality reduction becomes a proper choice for improving the efficiency for various machine learning algorithms.
The Predictive Analysis Library (PAL) in the SAP HANA Cloud Q4 2024 supports multiple dimensionality reduction algorithms capable of processing high-dimensional embedding vector data with principal component analysis, e.g. CATPCA and VECPCA.
2. PCA for Reducing the Dimensionality of Text Embeddings in SAP HANA Cloud
It is demonstrated in a few recent studies, that the traditional principal component analysis (i.e. PCA) is a good choice for reducing the dimensionality of text embeddings while maintaining most of the performance score of downstream learning tasks. Therefore, the PCA algorithm in SAP HANA Cloud prediction analysis library (PAL) Q4 2024 release is enhanced with native REAL_VECTOR type support, which is available for reducing the dimensionality of text embeddings. More specifically, a new SQL procedure called PAL_VECPCA has been introduced.
2.1 Key Parameters
PAL_VECPCA has several parameters, with key ones listed as follows:
Another procedure closely related to PAL_VECPCA is the SQL procedure PAL_VECPCA_PROJECT, which is used to transform the input data into the space spanned by the principal components output by PAL_VECPCA. The key parameter of this procedure is MAX_COMPONENTS.
3. Example Use Case
In this section we show an exemplary classification scenario using the wine review dataset (originally collected by Zack Thoutt ). It is a multi-class classification problem predicting wine variety based on several tabular (i.e. structured) features like sale price and country of origin, along with the non-structured text data for describing the wines (wine reviews). A coarse view of the dataset is demonstrated as follows (facilitate the use case, some features of the original dataset are excluded):
Among the features shown in the view above, 'ID' is the identifier column for each data piece (i.e. row data), 'country', 'points', 'price', 'province' are structured features (i.e. either numerical or categorical) of the original data, 'description' is the column of the text contents of wine reviews, 'PAL_EMBEDDINGS' is the vectorization of the 'description', which is newly generated through the transformation of text embedding model in SAP HANA Cloud (see also this blogpost) with vector dimension 768.
Assuming that the data is stored in a table called WINE_DATA_TRAIN_TBL. Then to reduce the dimensionality of the text embeddings, for example, from 768 to 192, one needs to execute the following SQL statements:
DROP TABLE VECPCA_PARAM_TBL;
CREATE COLUMN TABLE VECPCA_PARAM_TBL (
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO VECPCA_PARAM_TBL VALUES ('N_COMPONENTS', 192, NULL, NULL);
INSERT INTO VECPCA_PARAM_TBL VALUES ('SCORES', 1, NULL, NULL);
DROP TABLE PAL_PCA_LOADINGS_TBL;
--Specify the output table for component loadings
CREATE COLUMN TABLE PAL_PCA_LOADINGS_TBL ("VARIABLE_NAME" NVARCHAR(100), "COMPONENT_ID" INTEGER, "COMPONENT_LOADING" DOUBLE);
DROP TABLE PAL_PCA_SCORES;
--Specify component score for each data instance, corresponding to PCA result of the train data
CREATE COLUMN TABLE PAL_PCA_SCORES ("ID" INTEGER, "SCORE_VECTOR" REAL_VECTOR);
DROP TABLE PAL_PCA_SCALING_INFORMATION_TBL;
-- Specify the mean and scale info for each variable
CREATE COLUMN TABLE PAL_PCA_SCALING_INFORMATION_TBL ("VARIABLE_NAME" NVARCHAR(100), "MEAN" DOUBLE, "SCALE" DOUBLE);
DO BEGIN
lt_data = SELECT ID, PAL_EMBEDDINGS FROM WINE_DATA_TRAIN_TBL;
lt_param = SELECT * FROM VECPCA_PARAME_TBL;
CALL _SYS_AFL.PAL_VECPCA (:lt_data, :lt_param, lt_loadings, lt_loadings_info, lt_scores, lt_scaling_info);
--loadings are a necessity of a PCA model
INSERT INTO PAL_PCA_LOADINGS_TBL
SELECT * FROM :lt_loadings;
--scores stores the PCA transformed result of the train data
INSERT INTO PAL_PCA_SCORES
SELECT * FROM :lt_scores;
--scaling info is also a necessity of a PCA model
INSERT INTO PAL_PCA_SCALING_INFORMATION_TBL
SELECT * FROM :lt_scaling_info;
END;
Note that the PCA result of the training data is contained in output table “PAL_PCA_SCORES”, which contains row IDs and new embedding vectors of reduced dimensionality 192. The other two tables, namely PAL_PCA_LOADINGS_TBL and PAL_PCA_SCALING_INFORMATION_TBL store the information for applying PCA transformation to further incoming data.
The choice of multi-class classifier is a properly configurated hybrid gradient boosting tree classifier under the unified classification interface, with parameters specified defined as follows:
DROP TABLE UHGBT_PARAM_TBL;
CREATE COLUMN TABLE UHGBT_PARAM_TBL (
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO HGBT_PARAM_TBL VALUES ('FUNCTION', NULL, NULL, 'HGBT');
INSERT INTO HGBT_PARAM_TBL VALUES ('KEY', 1, NULL, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('N_ESTIMATORS', 1000, NULL, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('MAX_DEPTH', 4, NULL, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('LEARNING_RATE',NULL, 0.1, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('VALIDATION_SET_RATE',NULL, 0.1, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('STARTIFIED_VALIDATION_SET', 1, NULL, NULL);
INSERT INTO HGBT_PARAM_TBL VALUES ('TOLERATNT_ITER_NUM', 10, NULL, NULL);
After the model configuration, we can train a classifier as follows:
DROP TABLE UHGBT_MODEL_TBL;
CREATE COLUMN TABLE UHGBT_MODEL_TBL (
"ROW_INDEX" INTEGER,
"PART_INDEX" INTEGER,
"MODEL_CONTENT" NCLOB
);
DO BEGIN
lt_data = SELECT "WINE_DATA_TRAIN_TBL"."ID", "SCORE_VECTOR","country", "points", "price", "province", "variety" FROM WINE_DATA_TRAIN_TBL INNER JOIN PAL_PCA_SCORES ON "WINE_DATA_TRAIN_TBL"."ID"="PAL_PCA_SCORES"."ID";
lt_param = SELECT * FROM UHGBT_PARAM_TBL;
CALL _SYS_AFL.UNIFIED_CLASSIFICATION (:lt_data, :lt_param, lt_model, lt_varimp, lt_stats, lt_optparam, lt_confusion, lt_metrics, lt_paritition, lt_ph);
INSERT INTO UHGBT_MODEL_TBL
SELECT * FROM :lt_model;
END;
When the fitting process has been finished, one can then evaluate the performance of the model on the test dataset. Assuming that the test data is stored in a table with name 'WINE_DATA_TEST_TBL', then the first thing we need to do is to apply PCA to reduce the dimensionality of the embedding vectors (to be consistent with the dimension of embedding vectors in the training data of the HGBT model). To transform the embedding vectors in the test data, we must pass data along with the loadings and scaling info results (stored in tables 'PAL_PCA_LOADINGS_TBL' and 'PAL_PCA_SCALING_INFORMATION_TBL' respectively) of the previous execution of PAL_VECPAC, illustrated as follows:
DROP TABLE VECPCA_PROJ_PARAM_TBL;
CREATE COLUMN TABLE VECPCA_PROJ_PARAM_TBL (
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
--INSERT INTO VECPCA_PROJ_PARAM_TBL VALUES ('MAX_COMPONENTS', 192, NULL, 'HGBT');
DROP TABLE VECPCA_PROJ_RESULT_TBL;
CREATE COLUMN TABLE VECPCA_RESULT_PARAM_TBL (
"ID" INTEGER,
"SCORE_VECTOR" REAL_VECTOR
);
DO BEGIN
lt_data = SELECT ID, PAL_EMBEDDINGS FROM WINE_DATA_TEST_TBL;
lt_loadings = SELECT * FROM PAL_PCA_LOADINGS_TBL;
lt_scale_info = SELECT * FROM PAL_PCA_SCALING_INFORMATION_TBL;
lt_param = SELECT * FROM VECPCA_PROJ_PARAM_TBL;
CALL _SYS_AFL.PAL_VECPCA_PROJECT(:lt_data, :lt_loadings, :lt_scale_info, :lt_param, lt_result);
INSERT INTO VECPCA_PROJ_RESULT_TBL
SELECT * FROM lt_result;
END;
Now we can evaluate the performance of the previously trained HGBT model on the test dataset, illustrated as follows:
DROP TABLE UHGBT_SCORE_PARAM_TBL;
CREATE COLUMN TABLE UHGBT_SCORE_PARAM_TBL (
"PARAM_NAME" NVARCHAR(256),
"INT_VALUE" INTEGER,
"DOUBLE_VALUE" DOUBLE,
"STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO UHGBT_SCORE_PARAM_TBL VALUES ('FUNCTION', NULL, NULL, 'HGBT');
DO BEGIN
lt_data = SELECT "WINE_DATA_TEST_TBL"."ID", "SCORE_VECTOR", "country", "points", "price", "province", "variety" FROM WINE_DATA_TEST_TBL INNER JOIN PAL_PCA_SCORES ON "WINE_DATA_TEST_TBL"."ID" = "VECPCA_PROJ_RESULT_TBL"."ID";
--loading the trianed HGBT model into a table variable
lt_model = SELECT * FROM UHGBT_MODEL_TBL;
lt_param = UHGBT_SCORE_PARAM_TBL;
CALL _SYS_AFL.PAL_UNIFIED_CLASSIFICATION_SCORE(:lt_data, :lt_model, :lt_param, lt_result, lt_stats, lt_cm, lt_metrics);
--all evaluation scores are stored in table variable lt_stats
SELECT * FROM lt_stats;
END;
4. The Effect of Dimensionality Reduction
In this subsection we show some key statistics of the effect of the dimensionality reduction on text embeddings, illustrated as follows:
It is seen from the Figure above that, the use of text embeddings as a structured feature benefits the model's scoring performance on the test data (e.g. from 0.48 to 0.77+). However, under the full dimension of the text embeddings the time consumption for training the model is a bit high (around 40 mins in our test). In contrast, if we reduce the dimensionality of the text embeddings from 768 to 92, the model's training time shall be reduced by over 80%, with only a small degradation of test accuracy (from 0.795 to 0.784) being the paid price. This can be useful for, e. g. scenarios where resources are limited or constrained. Moreover, the high-dimensional nature of text embeddings could be too heavy for traditional machine learning models, reflected by high time/memory consumptions. A PCA transformer with proper dimension can be used to transform high-dimensional text embedding vectors into low-dimensional ones without much affecting the models' overall performance, and both data storage consumption and the models' training time could be effectively reduced.
Note: The size of training data (exclusive of the non-structured description texts) with full embedding size (i.e. 768) has size over 250Mb, while the size can be as small as 2.5Mb if text embeddings are removed from it, so for this dataset the majority part of static memory consumption is attributed to its text embeddings.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 46 | |
| 37 | |
| 36 | |
| 31 | |
| 30 | |
| 28 | |
| 27 | |
| 26 | |
| 26 | |
| 25 |