With the SAP HANA Cloud 2024 Q4 release, several new embedded Machine Learning / AI and NLP functions have been released with the SAP HANA Cloud Predictive Analysis Library (PAL). Key new capabilities to be highlighted are introduction of text analysis, text vectorization, vector data as input to many more Machine Learning functions like classification and regression in the SAP HANA Cloud database. An enhancement summary is available in the What’s new document for SAP HANA Cloud database 2024.40 (QRC 4/2024).
Text data or documents, for further analysis typically requires to be broken down into smaller, more manageable segments such as sentences, paragraphs, or semantic units. This facilitates improved subsequent natural language processing (NLP) tasks. Moreover, processing text units by embedding models and Large Language Models (LLMs) could result in information loss or inaccuracies, as these models often have text token length limits for processing. Hence, text chunking ensures efficient text processing, mitigating potential issues.
A new text chunking function is introduced in the Predictive Analysis Library (PAL), which supports multiple splitting methods
For text chunking a SQL function as well as new method in the Python ML client is made available:
The new text splitting function are explained in more detail in the following blog post: Text chunking - an exciting new NLP function in SAP HANA Cloud, by @xinchen .
A key technique in Natural Language Processing (NLP) is text analysis, aiming to automatically extract information like a sentiment expressed or detected entities from unstructured text data.
In SAP HANA Cloud a new text analysis function within the Predictive Analysis Library is now introduced, which utilizes tiny BERT NLP models to process the text data for the following tasks
The text analysis task to be applied for each text object, can be specified using the task-column (e.g. ‘pos, ner, sentiment-phrase-score’) of the input table to the PAL_TEXT_Analysis function:
A more detailed introduction about the new capabilities is provided in the following blog post New Text Analysis in SAP HANA Cloud Predictive Analysis Library (PAL), by @likun_hou.
Text analysis in SAP HANA Cloud supports initially the following five languages: English, German, French, Spanish, and Portuguese.
In recent years innovations in the field of NLP, the use of deep neural networks has emerged as an efficient technique for processing non-structured texts, as revealed by the power of large language models (LLMs). Specifically, texts can be mapped into a high-dimensional latent space by various text embedding models. Text embedding models can map semantically similar or distinct texts as numerically neighboring or far apart in the text embedding vector space. Embedding vectors, as structured numerical data, are therefore the best choice for encoding the semantic meaning of texts in applications that include similarity search, classification, regression or clustering.
SAP HANA Cloud introduces new text vectorization capabilities by use of new text embedding functions (PAL_TEXTEMBEDDING, and the SQL function VECTOR_EMBEDDING), exposing the use of a specific Text Embedding Model in SAP HANA Cloud. This is a big step toward, making text processing smoother and more efficient, right where your text data lives — with the database.
Text embedding vectors are stored in SAP HANA Cloud columns of type REAL_VECTOR, depending of analysis task and purpose the embedding vectors are used, additional techniques may be applied (discussed further below) to reduce the dimensionality of the vector data, reducing the in-memory footprint for storage and processing.
For a more detailed capability description see the following blog post Text Embedding Service in SAP HANA Cloud Predictive Analysis Library (PAL), by @likun_hou.
As part of the SAP HANA Cloud instance provisioning and configuration steps, the Provisioned NLP model services in SAP HANA Cloud, the advanced settings capability “Natural Language Processing (NLP)” can be selected to enable the use of the text embedding model and the text analysis models. Further details are given in the following SAP NOTE 3527706. The capability is also available with SAP HANA Cloud trial.
Information retrieval from text typically starts with text search techniques. A new keyword-based search function BM25 search (SEARCH_DOCS_BY_KEYWORDS) is provided with the SAP HANA Cloud Predictive Analysis Library (PAL). BM25, is a widely used ranking functions in text search, it estimates the relevance of documents compared to a list of search keywords provided. It builds on top of term-/document-frequencies (TF_IDF values).
The new keyword-based search function is explained with more detail in the following blog post: New information retrieval techniques in SAP HANA Cloud using BM25 and ANNS for Advanced Text Mining, @xinchen .
Text embedding vectors allow to identify matching documents by similarity or distance in the high-dimensional vector space. Approximate nearest neighbor search (ANNS) techniques, scale this task by handling similarity approximation with large scale voluminous vectorized text context stores.
The new embedding-based advanced text mining function (approximate nearest neighbor search) applies the Flat IVF algorithm, grouping the original number of vectors into a smaller number of clusters and during search, comparing the search vectors similarity to the centers of the grouped clusters. The PAL ANN search models moreover are instantiated as stateful in-memory models. By applying approximation and stateful in-memory search models, providing a much faster response in search predictions.
The new advanced text mining function is explained with more detail in the following blog post: New information retrieval techniques in SAP HANA Cloud using BM25 and ANNS for Advanced Text Mining, by @xinchen
Embedding vectors, are based on a high-dimensional complex datatype, often composed of of 768, 1536 or more dimensions (numeric elements), hence imposing a significant storage and processing footprint to be considered. Not all application scenarios require the original scale vector, but can benefit from dimensionality reduction techniques like Principal Component Analysis (PCA). PCA is an efficient technique to extract a lower dimensional space of variance representing component (new “vector space”), by selecting smaller number of components of e.g. 200, 100, 64, etc. compared to the dimensionality of the input vector data type.
The new vector dimension reduction function in the Predictive Analysis Library VECPCA, support REAL_VECTOR data input and provides the output vector of same type by lower dimensionality. The CATPCA PAL function has been enhanced to support vector as input data in mix with other features.
For a more detailed in depth use case description see the following blog post Dimensionality Reduction of Text Embeddings for Hybrid Prediction Data, by @likun_hou.
Identifying groups of similar vectors is one common use case approach for e.g. identifying groups of similar text documents by utilizing their text embedding vectors. Text analysis and term-frequency analysis task can then provide simple and out of the box techniques to describe the groups by their common or discriminating terms or entities in SAP HANA Cloud.
In addition to DBSCAN and HDBSCAN, the current release add support for vector clustering using KMEANS.
The example above shows the KMEANS clustering of an reduced vector by applying CATPCA before, and visualizing the clusters Ids mapped to the PAL TSNE components of the text embedding vector (see example below).
Another more detailed use case description can be found in the following blog post Document Clustering using KMeans and Text Embeddings, by @likun_hou.
With the use of vector data for machine learning processing, the semantic understanding of your text data can now be unlocked as vector attributes or in mix with classic numeric or categorial feature attributes. Such hybrid models, which combine text-embeddings with structured features, demonstrate superior prediction ML models due to their ability to leverage rich contextual information from both textual and structured data.
The PAL functions HGBT, MLP_MULTI_TASK for classification, regression along with the PARTITION and MLP_RECOMMENDER functions are now enabled to process and fully exploit the value of your text vector data, to build even better prediction models.
The example above shows an example of the potential benefit of utilizing text embedding vectors (and reduced VECPCA “vectors”), in a example of classifying car complaints categories. The classification accuracy is greatly improved by unlocking the insights from the complaints texts and their embedding vectors.
For a more in depth use case description see the following blog post Hybrid Prediction with Tabular and Text Inputs using Hybrid Gradient Boosting Trees, by @likun_hou.
However, especially for these classification, regression task scenarios, reducing the text-embedding vectors high dimensionality to a lower dimensional PCA-vectors may provide a very significant runtime benefit (much smaller memory footprint and performance), while maintaining a maximum of the text embeddings contextual insight. Hence it is clearly recommended to validate text embedding vector preprocessing using VECPCA or CATPCA.
New automated outlier detection for regression operation during AutoML optimization
An new blog post has been added to overview machine learning model and predictions explainability for PAL AutoML and pipeline models, by @xinchen .
Enhanced Multi-task MLP Neural Network modeling
The full list of new methods and enhancements with hana_ml 2.23 is summarized in the changelog for hana-ml 2.23 as part of the documentation. The key enhancements in this release include
AutoML and pipeline modeling improvements
New text- and vector processing PAL methods
Further misc. enhancements
You can find an examples notebook illustrating the highlighted feature enhancements here 24QRC04_2.23.ipynb and 24QRC04_2.23_unlocking_text_data.ipynb.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 46 | |
| 42 | |
| 38 | |
| 32 | |
| 28 | |
| 28 | |
| 27 | |
| 23 | |
| 23 | |
| 23 |