Technology Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Product and Topic Expert
Product and Topic Expert
845

1. Introduction 

Text, as a typical type of non-structured data, can be seen everywhere. In many application scenarios, text usually comes along with structured or tabular (numerical and categorical) data. However, many classical data mining algorithms only target structured data. To unveil the potential power in the hybrid data containing both text and tabular inputs, it is crucial to convert the non-structured text inputs into structured formats. Traditionally, texts are usually converted to real vectors through word-counting based approaches. However, those traditional approaches are usually considered insufficient for representing the semantic meaning of the text inputs since they commonly ignored the order and relations of words that appeared in texts. In recent years, we have witnessed the emergence of large language models (LLMs), which can process texts so effectively and it is beyond many people’s wildest imagination. LLMs are pre-trained neural networks which accept raw texts as input. In the forward propagation process of the neural network, raw texts are firstly tokenized and then passed into subsequent layers to create more semantic related representations called embedding vectors (or simply text embeddings). Embedding vectors are passed into the final layers of neural networks that are usually task specific.    

In many machine learning tasks involving hybrid input data, text embeddings have already shown advantages against its predecessors. However, current general-purpose models for text embeddings usually produce embedding vectors of dimension ranging from a few hundreds to tens of hundreds, which is usually a bit higher than most classical machine learning algorithms can efficiently handle.  In such cases, dimensionality reduction becomes a proper choice for improving the efficiency for various machine learning algorithms. 

The Predictive Analysis Library (PAL) in the SAP HANA Cloud Q4 2024 supports multiple dimensionality reduction algorithms capable of processing high-dimensional embedding vector data with principal component analysis, e.g. CATPCA and VECPCA. 

2.  PCA for Reducing the Dimensionality of Text Embeddings in SAP HANA Cloud 

It is demonstrated in a few recent studies, that the traditional principal component analysis (i.e. PCA) is a good choice for reducing the dimensionality of text embeddings while maintaining most of the performance score of downstream learning tasks. Therefore, the PCA algorithm in SAP HANA Cloud prediction analysis library (PAL) Q4 2024 release is enhanced with native REAL_VECTOR type support, which is available for reducing the dimensionality of text embeddings. More specifically, a new SQL procedure called PAL_VECPCA has been introduced. 

2.1 Key Parameters 

PAL_VECPCA has several parameters, with key ones listed as follows: 

  • N_COMPONENTS: This parameter determines the number of principal components of the training data to retain in the resulting PCA model, and it is also the maximum dimension that the transformed data can have after being transformed by the fitted PCA model. 
  • SCORES: This parameter determines whether to return the scoring result of the input data for PAL_VECPCA (1 means YES and 0 means NO). In particular, the scoring result is the dimensionality reduction result of the input data via PCA transformation. 

Another procedure closely related to PAL_VECPCA is the SQL procedure PAL_VECPCA_PROJECT, which is used to transform the input data into the space spanned by the principal components output by PAL_VECPCA. The key parameter of this procedure is MAX_COMPONENTS 

  •  MAX_COMPONENTS: This parameter determines the dimensionality of the output vector after PCA transformation is applied. The value of MAX_COMPONENTS must be no greater than the value of N_COMPONENTS in PAL_VECPCA (which is the default value of MAX_COMPONENTS).  

3. Example Use Case 

In this section we show an exemplary classification scenario using the wine review dataset (originally collected by Zack Thoutt ). It is a multi-class classification problem predicting wine variety based on several tabular (i.e. structured) features like sale price and country of origin, along with the non-structured text data for describing the wines (wine reviews). A coarse view of the dataset is demonstrated as follows (facilitate the use case, some features of the original dataset are excluded):

likun_hou_0-1733732185887.png

likun_hou_1-1733732185889.png

Among the features shown in the view above, 'ID' is the identifier column for each data piece (i.e. row data),  'country', 'points', 'price', 'province' are structured features (i.e. either numerical or categorical) of the original data, 'description' is the column of the text contents of wine reviews, 'PAL_EMBEDDINGS' is the vectorization of the 'description', which is newly generated through the transformation of  text embedding model in SAP HANA Cloud (see also this blogpost) with vector dimension 768. 

Assuming that the data is stored in a table called WINE_DATA_TRAIN_TBL.  Then to reduce the dimensionality of the text embeddings, for example, from 768 to 192, one needs to execute the following SQL statements: 

 

DROP TABLE VECPCA_PARAM_TBL; 
CREATE COLUMN TABLE VECPCA_PARAM_TBL ( 
    "PARAM_NAME" NVARCHAR(256), 
    "INT_VALUE" INTEGER, 
    "DOUBLE_VALUE" DOUBLE, 
    "STRING_VALUE" NVARCHAR(1000) 
); 
INSERT INTO VECPCA_PARAM_TBL VALUES ('N_COMPONENTS', 192, NULL, NULL); 
INSERT INTO VECPCA_PARAM_TBL VALUES ('SCORES', 1, NULL, NULL); 
DROP TABLE PAL_PCA_LOADINGS_TBL; 
--Specify the output table for component loadings 
CREATE COLUMN TABLE PAL_PCA_LOADINGS_TBL ("VARIABLE_NAME" NVARCHAR(100), "COMPONENT_ID" INTEGER, "COMPONENT_LOADING" DOUBLE); 
DROP TABLE PAL_PCA_SCORES; 
--Specify component score for each data instance, corresponding to PCA result of the train data 
CREATE COLUMN TABLE PAL_PCA_SCORES ("ID" INTEGER, "SCORE_VECTOR" REAL_VECTOR); 
DROP TABLE PAL_PCA_SCALING_INFORMATION_TBL; 
-- Specify the mean and scale info for each variable 
CREATE COLUMN TABLE PAL_PCA_SCALING_INFORMATION_TBL ("VARIABLE_NAME" NVARCHAR(100), "MEAN" DOUBLE, "SCALE" DOUBLE); 

DO BEGIN 
lt_data = SELECT ID, PAL_EMBEDDINGS FROM WINE_DATA_TRAIN_TBL; 
lt_param = SELECT * FROM VECPCA_PARAME_TBL; 
CALL _SYS_AFL.PAL_VECPCA (:lt_data, :lt_param, lt_loadings, lt_loadings_info, lt_scores, lt_scaling_info); 
--loadings are a necessity of a PCA model 
INSERT INTO PAL_PCA_LOADINGS_TBL 
SELECT * FROM :lt_loadings; 
--scores stores the PCA transformed result of the train data 
INSERT INTO PAL_PCA_SCORES 
SELECT * FROM :lt_scores; 
--scaling info is also a necessity of a PCA model 
INSERT INTO PAL_PCA_SCALING_INFORMATION_TBL 
SELECT * FROM :lt_scaling_info; 
END; 

 

 

Note that the PCA result of the training data is contained in output table “PAL_PCA_SCORES”, which contains row IDs and new embedding vectors of reduced dimensionality 192.  The other two tables, namely PAL_PCA_LOADINGS_TBL and PAL_PCA_SCALING_INFORMATION_TBL store the information for applying PCA transformation to further incoming data. 

The choice of multi-class classifier is a properly configurated hybrid gradient boosting tree classifier under the unified classification interface, with parameters specified defined as follows: 

 

DROP TABLE UHGBT_PARAM_TBL; 
CREATE COLUMN TABLE UHGBT_PARAM_TBL ( 
    "PARAM_NAME" NVARCHAR(256), 
    "INT_VALUE" INTEGER, 
    "DOUBLE_VALUE" DOUBLE, 
    "STRING_VALUE" NVARCHAR(1000) 
); 
INSERT INTO HGBT_PARAM_TBL VALUES ('FUNCTION', NULL, NULL, 'HGBT'); 
INSERT INTO HGBT_PARAM_TBL VALUES ('KEY', 1, NULL, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('N_ESTIMATORS', 1000, NULL, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('MAX_DEPTH', 4, NULL, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('LEARNING_RATE',NULL, 0.1, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('VALIDATION_SET_RATE',NULL, 0.1, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('STARTIFIED_VALIDATION_SET', 1, NULL, NULL); 
INSERT INTO HGBT_PARAM_TBL VALUES ('TOLERATNT_ITER_NUM', 10, NULL, NULL); 

 

 After the model configuration, we can train a classifier as follows: 

 

DROP TABLE UHGBT_MODEL_TBL; 
CREATE COLUMN TABLE UHGBT_MODEL_TBL ( 
    "ROW_INDEX" INTEGER, 
    "PART_INDEX" INTEGER, 
    "MODEL_CONTENT" NCLOB 
); 
DO BEGIN 
lt_data = SELECT "WINE_DATA_TRAIN_TBL"."ID", "SCORE_VECTOR","country", "points", "price", "province", "variety" FROM WINE_DATA_TRAIN_TBL INNER JOIN PAL_PCA_SCORES ON "WINE_DATA_TRAIN_TBL"."ID"="PAL_PCA_SCORES"."ID"; 
lt_param = SELECT * FROM UHGBT_PARAM_TBL; 
CALL _SYS_AFL.UNIFIED_CLASSIFICATION (:lt_data, :lt_param, lt_model, lt_varimp, lt_stats, lt_optparam, lt_confusion, lt_metrics, lt_paritition, lt_ph); 
INSERT INTO UHGBT_MODEL_TBL
SELECT * FROM :lt_model;
END; 

 

When the fitting process has been finished, one can then evaluate the performance of the model on the test dataset. Assuming that the test data is stored in a table with name 'WINE_DATA_TEST_TBL', then the first thing we need to do is to apply PCA to reduce the dimensionality of the embedding vectors (to be consistent with the dimension of embedding vectors in the training data of the HGBT model). To transform the embedding vectors in the test data, we must pass data along with the loadings and scaling info results (stored in tables 'PAL_PCA_LOADINGS_TBL' and 'PAL_PCA_SCALING_INFORMATION_TBL' respectively) of the previous execution of PAL_VECPAC, illustrated as follows:   

 

DROP TABLE VECPCA_PROJ_PARAM_TBL; 
CREATE COLUMN TABLE VECPCA_PROJ_PARAM_TBL ( 
    "PARAM_NAME" NVARCHAR(256), 
    "INT_VALUE" INTEGER, 
    "DOUBLE_VALUE" DOUBLE, 
    "STRING_VALUE" NVARCHAR(1000) 
); 

--INSERT INTO VECPCA_PROJ_PARAM_TBL VALUES ('MAX_COMPONENTS', 192, NULL, 'HGBT'); 
DROP TABLE VECPCA_PROJ_RESULT_TBL; 
CREATE COLUMN TABLE VECPCA_RESULT_PARAM_TBL ( 
    "ID" INTEGER, 
    "SCORE_VECTOR" REAL_VECTOR 
); 
DO BEGIN 
lt_data = SELECT ID, PAL_EMBEDDINGS FROM WINE_DATA_TEST_TBL; 
lt_loadings = SELECT * FROM PAL_PCA_LOADINGS_TBL; 
lt_scale_info = SELECT * FROM PAL_PCA_SCALING_INFORMATION_TBL; 
lt_param = SELECT * FROM VECPCA_PROJ_PARAM_TBL; 
CALL _SYS_AFL.PAL_VECPCA_PROJECT(:lt_data, :lt_loadings, :lt_scale_info, :lt_param, lt_result); 
INSERT INTO VECPCA_PROJ_RESULT_TBL 
SELECT * FROM lt_result; 
END;

 

Now we can evaluate the performance of the previously trained HGBT model on the test dataset, illustrated as follows:

 

DROP TABLE UHGBT_SCORE_PARAM_TBL;
CREATE COLUMN TABLE UHGBT_SCORE_PARAM_TBL (
    "PARAM_NAME" NVARCHAR(256),
    "INT_VALUE" INTEGER,
    "DOUBLE_VALUE" DOUBLE,
    "STRING_VALUE" NVARCHAR(1000)
);
INSERT INTO UHGBT_SCORE_PARAM_TBL VALUES ('FUNCTION', NULL, NULL, 'HGBT');
DO BEGIN
lt_data = SELECT "WINE_DATA_TEST_TBL"."ID", "SCORE_VECTOR", "country", "points", "price", "province", "variety" FROM WINE_DATA_TEST_TBL INNER JOIN PAL_PCA_SCORES ON "WINE_DATA_TEST_TBL"."ID" = "VECPCA_PROJ_RESULT_TBL"."ID";
--loading the trianed HGBT model into a table variable
lt_model = SELECT * FROM UHGBT_MODEL_TBL;
lt_param = UHGBT_SCORE_PARAM_TBL;
CALL _SYS_AFL.PAL_UNIFIED_CLASSIFICATION_SCORE(:lt_data, :lt_model, :lt_param, lt_result, lt_stats, lt_cm, lt_metrics);
--all evaluation scores are stored in table variable lt_stats
SELECT * FROM lt_stats;
END;

 

4. The Effect of Dimensionality Reduction 

In this subsection we show some key statistics of the effect of the dimensionality reduction on text embeddings, illustrated as follows: 

 output_1.png

It is seen from the Figure above that, the use of text embeddings as a structured feature benefits the model's scoring performance on the test data (e.g. from 0.48 to 0.77+). However, under the full dimension of the text embeddings the time consumption for training the model is a bit high (around 40 mins in our test). In contrast, if we reduce the dimensionality of the text embeddings from 768 to 92, the model's training time shall be reduced by over 80%, with only a small degradation of test accuracy (from 0.795 to 0.784) being the paid price. This can be useful for, e. g. scenarios where resources are limited or constrained. Moreover, the high-dimensional nature of text embeddings could be too heavy for traditional machine learning models, reflected by high time/memory consumptions. A PCA transformer with proper dimension can be used to transform high-dimensional text embedding vectors into low-dimensional ones without much affecting the models' overall performance, and both data storage consumption and the models' training time could be effectively reduced. 

Note: The size of training data (exclusive of the non-structured description texts) with full embedding size (i.e. 768)  has size over 250Mb, while the size can be as small as 2.5Mb if text embeddings are removed from it, so for this dataset the majority part of static memory consumption is attributed to its text embeddings.