Enterprise Resource Planning Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Product and Topic Expert
Product and Topic Expert
1,642

 1.Introduction 

Text is ubiquitous.  On one hand, it is always important for us to understand and comprehend the world of things around us through various text descriptions in human society. On the other hand, the task of fully unveiling the power of texts automatically by machines has long been proven to be difficult. Many algorithms for predictive analysis are effective at handling structured/tabular (i.e. numerical and categorical) features, yet text features are usually excluded since they are commonly non-structured.  

To facilitate the use of texts in various predictive analysis tasks, there has been a long-lasting endeavor of converting non-structured texts into structured data types. Earlier efforts of such kind include the renowned bag-of-words model, which represents text by the collection of words in it while ignoring fact that those words appear in sequential order. A typical implementation of bag-of-words model which is the TF-IDF analysis, where frequencies of words in a document are ‘normalized’ by their inverse document frequencies. Ignoring the order of words in texts using bag-of-words model does not harm in same specific application scenarios, e.g. for clustering documents gathered from several domains where text contents are dominated by domain-specific words. Bag-of-word approaches are also simple and fast for processing texts.  However, ignoring the words’ order also makes the corresponding representations in sufficient at determining the semantic meaning of the original texts, making them less useful in real applications. 

In recent years, use of deep neural networks has emerged as an efficient technique for processing non-structured texts, as revealed by the power of large language models (LLMs). Specifically, texts can be mapped into a high-dimensional latent space by various text embedding models. Texts, after the embedding process, become points in the latent space, represented by numerical vectors. Compared to bag-of-words models, text embedding models commonly adopt the transformer network structure as their main building block, which puts more emphasis on the sequential order of words in texts. Text embedding models are trained/fine-tuned to map semantically similar texts metrically neighboring, and semantically distinct texts metrically far apart in the latent space. Based on this property, text embedding could be useful for applications that require semantic text similarity. Several typical applications of such kind are listed as follows: 

  • Text/document clustering 
  • Text search & document retrieval 
  • Text Classification 
  • Hybrid predictive analysis with structured features and texts 

2. Text Embedding in SAP HANA Cloud Predictive Analysis Library (PAL) 

As mentioned above, texts are widely available and can persist important information. Embedding vectors, as structured numerical data, are currently the best choice for encoding the semantic meaning of texts in applications that include predictive analysis tasks such as classification, regression, and clustering.  In view of this, text embedding models and services are now readily available in SAP HANA Cloud and are reachable through the API provided by the SAP HANA Cloud Predictive Analysis Library (PAL). In particular, the text embedding model deployed for PAL usage is of the transformer structure. 

It is firstly pre-trained with a large corpus and later fine-tuned on the PAL side. For more information on this model, please see Text Embedding Model for further details. 

2.1 SQL API for Text Embedding 

The native API for text embedding in SAP HANA Cloud PAL is the SQL interface. Basically, to convert a list of text documents into their corresponding embedding vectors, one needs to firstly select the text documents along with their corresponding IDs, and then pass selected data into the SQL procedure PAL_TEXTEMBEDDING together with the following configuration parameters: 

  • MODEL_VERSION - This parameter specifies the version of model for generating text embeddings. Note that the latest model will be used if the model version is not specified or wrongly specified. 
  • IS_QUERY – This parameter specifies the type of input text, where an integer value of 0 for normal text, and an integer value of 1 for query text. Different text types will result in different embedding vectors. So, if a piece of input text is specified with a different type, different embedding vectors will be generated. 
  • MAX_TOKEN_NUM – Texts are firstly tokenized and then passed to subsequent layers of text embedding models for generating latent representations, i.e. embedding vectors. This parameter specifies the maximum number of tokens for the backend text embedding model to handle. If the length of the tokenized text excesses MAX_TOKEN_NUM, the text will be truncated, causing information loss. Using a larger value will enable the embedding model to handle longer texts more effectively, yet it also increases the resource consumption for generating text embeddings. In most applications, it is recommended that users set a moderate value (e.g. the default value) for this parameter in conjunction with text chunking for efficiently handling long texts. Note that the default value of this parameter is model dependent, and the upper limit is 1024. 
  • BATCH_SIZE - In SAP HANA Cloud PAL, texts are batched into requests and then sent to the backend embedding service. This parameter specifies the number of pieces of texts to be batched into a single request. 
  • THREAD_NUMBER - This parameter specifies how many http connections will be established at same time to the backend embedding service.  Using a larger thread usually results in less time for generating text embedding vectors. 

Interested users may refer to the original PAL documentation of Text Embeddings for more details.  

After the embedding process has been finished, the embedding vectors are stored in the result table, which is the first output table of the procedure PAL_TEXTEMBEDDING. 

While there is also a scalar SQL function VECTOR_EMBEDDING, this blogpost focuses on the PAL SQL and Python interfaces. One advantage of the PAL_TEXTEMBEDDING is the batch processing option, yielding in a better throughput. 

Note that the embedding model processes text with a maximum token number (e.g. 256), users are suggested to apply text chunking as described in this blogpost [link to blogpost on text chunking] to manage long texts with many tokens. Note that the token size of an input text can only be determined exactly if it is tokenized by the tokenizer of the corresponding text embedding model, which could be computationally expensive.  A simple rule of thumb for common western languages is to use the total number of characters in the text divided by 4 as the approximate token size of the target text.

2.2. Code Example using SQL API 

In the following context we show an example on how to use the SQL API for text embedding in SAP HANA Cloud PAL.

 

DROP TABLE SAMPLE_TEXT_TAB; 
CREATE COLUMN TABLE SAMPLE_TEXT_TAB( 
    "ID" INTEGER,  
    "TEXT" NVARCHAR(1000) 
); 
INSERT INTO SAMPLE_TEXT_TAB VALUES (1, 'The dog is awesome.'); 
INSERT INTO SAMPLE_TEXT_TAB VALUES (2, 'The weather is awesome.'); 
INSERT INTO SAMPLE_TEXT_TAB VALUES (3, 'Today is a sunny day.'); 
DROP TABLE PAL_TEXT_EMB_PARAMETER_TBL; 
CREATE COLUMN TABLE PAL_TEXT_EMB_PARAMETER_TBL( 
    "PARAM_NAME" NVARCHAR(256),  
    "INT_VALUE" INTEGER,  
    "DOUBLE_VALUE" DOUBLE, 
    "STRING_VALUE" NVARCHAR(1000) 
); 
INSERT INTO PAL_TEXT_EMB_PARAMETER_TBL VALUES ('MODEL_VERSION', NULL, NULL, '20240715'); 
INSERT INTO PAL_TEXT_EMB_PARAMETER_TBL VALUES ('IS_QUERY', 0, NULL, NULL); 
INSERT INTO PAL_TEXT_EMB_PARAMETER_TBL VALUES ('THREAD_NUMBER', 25, NULL, NULL); 
INSERT INTO PAL_TEXT_EMB_PARAMETER_TBL VALUES ('BATCH_SIZE', 20, NULL, NULL); 
DROP TABLE PAL_TEXT_EMB_VEC_RESULT_TBL; 
CREATE COLUMN TABLE PAL_TEXT_EMB_VEC_RESULT_TBL( 
    "ID" INTEGER, 
    "VECTOR" REAL_VECTOR, 
    "EXT" NVARCHAR(5000) 
); 
DROP TABLE PAL_TEXT_EMB_VEC_STAT_TBL; 

CREATE COLUMN TABLE PAL_TEXT_EMB_VEC_STAT_TBL( 
    "KEY" NVARCHAR(5000), 
    "VALUE" NVARCHAR(5000), 
    "EXT" NVARCHAR(5000) 
); 
DO BEGIN 
  tv_data = SELECT * FROM SAMPLE_TEXT_TAB; 
  tv_param = SELECT * FROM PAL_TEXT_EMB_PARAMETER_TBL; 
  CALL _SYS_AFL.PAL_TEXTEMBEDDING(:tv_data, :tv_param, tv_out, tv_out2); 
  INSERT INTO PAL_TEXT_EMB_VEC_RESULT_TBL SELECT * FROM :tv_out; 
END; 

 

Now the embedding vectors of the input texts are stored in the table PAL_TEXT_EMB_VEC_RESULT_TBL, so to view the embedding results directly, one can execute the following SQL statement: 

 

SELECT ID, TO_NVARCHAR(VECTOR) FROM PAL_TEXT_EMB_VEC_RESULT_TBL; 

 

After obtaining the embedding vectors, we can calculate the cosine-similarity scores between those vectors to reveal their relative semantical closeness of the original sentences. Following is the example for computing the cosine similarity between the embedding vector of the sentence 'The weather is awesome.' as well as that of the sentence 'Today is a sunny day. ': 

 

SELECT COSINE_SIMILARITY((SELECT VECTOR FROM PAL_TEXT_EMB_VEC_RESULT_TBL WHERE ID=2), (SELECT VECTOR FROM PAL_TEXT_EMB_VEC_RESULT_TBL WHERE ID=3)) AS COS_SIM_SCORE FROM DUMMY; 

 

Other pairwise similarity scores of the three sentences in "SIMPLE_TEXT_TAB" can be computed similarly, and the results are summarized in the following table: 

Sentence Pairs (Text1 ID, Text2 ID) 

(1, 2) 

(1, 3) 

(2, 3) 

Cosine Similarity Score 

0.710001 

0.572349 

0.801566 

Let us take Sentence 2, i.e. ‘The weather is awesome.’ as the anchor point. Sentence 1 shares with sentence 2 the keyword ‘awesome’, yet the two are totally irrelevant; in contrast, Sentence 2 and Sentence 3 share no keyword in common, yet they are semantically relevant since they are both talking about good weather.  As revealed by the pair-wise cosine similarity scores, in the latent embedding space, sentence 2 and sentence 3 are much more similar than that of sentence 2 and sentence 1. This observation is consistent with the claim that text embeddings encode more semantic meaning of the target texts and map semantically similar texts close to each other. 

2.3 Python API for Text Embedding 

The Python API for text embedding is provided in Python machine learning API for SAP HANA Cloud (hana-ml).  It is wrapped in a Python class called PALEmbeddings, with a fit_predict() method that accepts configurable parameters for generating embedding vectors from input texts. The configurable parameters are the same as those listed in the SQL API, except that they are in lower cases. However, there are also some major distinctions between the SQL API and the Python API, listed as follows: 

  • In SQL API for text embedding, all the four configurable parameters are provided in total for the procedure PAL_EMBEDDING; while in Python API, the parameter model_version and max_token_num are placed in initialization method of class PALEmbeddings, while the other three parameters are placed in its fit_predict() method. 
  •  In SQL API, the input data for embedding should be text pieces with IDs. In the Python API, the input data for the fit_predict() method of the PALEmbeddings class could be any data that contains the target text pieces and their corresponding IDs. The fit_predict() methods also provide two additional parameters, namely key and target to help users specify the ID column and target text column in the input data. Moreover, the result data generated by the Python API (i.e. the fit_predict() method of PALEmbedding) contains not only the text embeddings but also retains other non-target columns in the original input data. 

2.4. Code Example using the Python API 

To better illustrate the use of Python API for text embedding, we also show the python code for processing the texts in subsection 2.2. 

 

from hana_ml.text.pal_embeddings import PALEmbeddings 
embed_model = PALEmbeddings(model_version='20240715') 
#cc is assumed to ConnectionContext object representing a valid connection to SAP HANA Cloud instance 
#where text data (with IDs) is stored in the table SAMPLE_TEXT_TAB. 
text_data = cc.table('SAMPLE_TEXT_TAB') 
text_embed = embed_model.fit_predict(data=text_data,
                                     key='ID', 
                                     target='TEXT', 
                                     thread_number=20,                                                                    
                                     batch_size=25, 
                                     is_query=False) 

 

To see the resulting text embeddings in the python interface, one can simply apply 

 

text_embed.collect() 

 

which converts the SAP HANA DataFrame embed_res into a Pandas DataFrame in Python and then displays its content.  

Suppose we have a query sentence ‘What is the weather like today?’, and would like to retrieve a reasonable answer from the sentences in  SAMPLE_TEXT_TAB. Intuitively, we know that sentence 2 (‘The weather is awesome.’) and sentence 3 (‘Today is a sunny day.’) are both relevant while sentence 1 (‘The dog is awesome.’) is not relevant.  We can transform our query sentence into an embedding vector, and then compute the cosine similarity scores between the embedding vector of the query sentence and embedding vectors of the sentences in table SAMPLE_TEXT_TAB. Assuming that the query sentence is contained a table QUERY_TEXT_TAB structured as follows: 

 

ID 

QUERY 

1 

‘What is the weather like today?’ 

 

Continuing from the code for embedding the sentences in SAMPLE_TEXT_TAB, the code for query embedding and cosine similarity score computation can be illustrated as follows: 

 

query_data = cc.table('QUERY_TEXT_TAB') 
query_embed = embed_model.fit_predict(data=query_data, 
                                      key=’ID’,                                                                              
                                      target=’QUERY’,                                        
                                      is_query=True)#set as True since out the aim of input sentence is for query purpose. 
sim_scores = [] # for storing pair-wise similarity scores 
for i in [1,2,3]: 
    sql = “SELECT COSINE_SIMILARITY((SELECT VECTOR_COL FROM ({}) WHERE ID=1), (SELECT VECTOR FROM ({}) WHERE ID={})) AS COS_SIM_SCORE FROM DUMMY;”.format(query_embed.select_statement, text_embed.select_statement, i) 
    sim_scores.append(cc.sql(sql).collect()[“COS_SIM_SCORE”][0]) 

 

 

 

 

The resulting similarity scores are listed as follows: 

Sentence Pairs (Query ID, Text ID) 

(1, 1) 

(1, 2) 

(1, 3) 

Cosine Similarity Score 

0.342104 

0.617587 

0.700786 

As illustrated by the similarity scores, the query is much more related to Sentence 2 and Sentence 3 compared to Sentence 1 in SAMPLE_TEXT_TAB, which agrees with our intuition. 

3.Discussion on the Dimensionality of Text Embedding Vectors 

Embedding models are powerful tools for encoding non-structured texts by structured embedding vectors, greatly facilitating the use of text features in various applications and improving the results of traditional bag-of-words models. However, the benefits of using embedding vectors also come along with some extra costs. Besides the computational cost for generating text embeddings, the high-dimensional nature of embedding vectors also makes them storage demanding. Since many predictive analysis algorithms are sensitive to the dimensionality of the input data, they will require more computational and memory resources if embedding vectors get involved. 

To reduce the extra costs of using text embeddings, choosing a lower dimensional latent space is usually inappropriate for generic texts since it will result in significant loss of information in texts. However, since in most applications the text data is restricted to specific domains, it is possible to represent those texts by numerical vectors in a much lower dimensional latent space. One way to achieve this objective is to use universal text embedding models with high-dimensional latent space followed by application-dependent pre-processors for dimensionality reduction. To cater for this need, a new procedure called PAL_VECPCA is introduced in SAP HANA Cloud PAL, which can help to reduce the dimensionality of high-dimensional embedding vectors into lower-dimensional vectors via principal component analysis (PCA).  

 5. Summary 

In this blogpost we have introduced how to use text embedding services through the API provided by the SAP HANA Cloud Predictive Analysis Library (PAL).  Both SQL interface and Python interface are illustrated, with detailed parameter descriptions and code examples. We also discuss the dimensionality issue for text embedding vectors and provide the solution in SAP HANA Cloud PAL for alleviating this issue. 

1 Comment