Technology Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
ChristophMorgen
Product and Topic Expert
Product and Topic Expert
3,403

With the SAP HANA Cloud 2025 Q2 release, several new embedded Machine Learning / AI functions have been released with the SAP HANA Cloud Predictive Analysis Library (PAL) and the Automated Predictive Library (APL).

Key new capabilities to be highlighted include

  • A new text embedding model version, covering more languages and short-text embedding scenarios,
  • more tabular AI functions enhanced to support vector data processing like AutoML, k-nearest neighbors (k-NN),
  • outlier detection using isolation forest supporting outlier explainability
  • machine learning experiment tracking and task scheduling capabilities
  • data drift analysis using the Automated Predictive Library

An enhancement summary is available in the What’s new document for SAP HANA Cloud database 2025.14 (QRC 2/2025).

Text processing, text embedding and vector processing using ML functions

Text tokenization

A new text tokenization function has been released, allowing to split text into tokens, a fundamental preparation step in natural language processing (NLP) like text analysis, and many other downstream text processing tasks like text embedding or vector similarity search. The new function support key tokenization capabilities like

  • removal of common stopwords (e.g., "the", "and", "is").
  • filtering out purely numeric and alphanumeric tokens
  • specifications of characters or symbols that should always be kept or removed
  • controls for stemming, language detection.

ChristophMorgen_0-1750792982539.png

 

Text embedding

A new text embedding model version (SAP_GXY.20250407) is made available, which has been fine-tuned based on a roberta-base-encoder model and more training data for improved retrieval accuracy, short text scenarios and extend multi-lingual & cross-lingual retrieval scenarios. New additional languages supported include Chinese (CH), Japanese (JP) and Italian (IT). The default token length for embedding functions has been increase to 512.

Extended embedding vector processing by Machine Learning functions

Unlocking the semantic understanding of your text data stored in SAP HANA Cloud, for use cases like similarity search, however moreover for machine learning scenarios like  document/text- classification and –clustering, and more has now been extended to even more PAL Machine Learning functions

  • K-nearest neighbor models (K-NN) for classification/regression/similarity search
  • Random Decision Trees for classification/regression models
  • AutoML scenarios for classification, regression and times series
    • including the fit, prediction, scoring, and pipeline procedures
    • support with all pipeline operators (algorithms), except for the Imputer and
       ImputeTS-operators

ChristophMorgen_1-1750792982549.png

 

Machine Learning function enhancements

Outlier detection enhancements using Isolation Forests

The Isolation Forest  function for outlier detection has been enhanced with

  • support categorial columns as features in outlier models
  • a new massive outlier detection function for parallel analysis of multiple data subsets
    (PAL_MASSIVE_ISOLATION_FOREST)
  • a new outlier explanation method based on Shapley values (PAL_ISOLATION_FOREST_EXPLAIN)
    • providing local explanations for the predicted outlier classification,
    • describing the contribution of each feature to the predicted classification,
      based on a Tree-SHAP explainer model,
    • options to configure the explanation function by setting contamination level (proportion of outliers), scope (explanations for outliers only or all predictions), top K contributions (set k number of features in explanations)

Isolation Forest is a strong and trending function for outlier detection, which can be applied on any data for outlier analysis inside the database, hence especially suitable also for use cases where data shall not leave the system or is too big to be copied out for analysis like use cases for detecting outliers on your financial accounting, like the universal journal data (ACDOCA).

ChristophMorgen_2-1750792982556.png

 

Constraint Clustering

A new constraint clustering function is introduced, an advanced form of clustering that incorporates domain-specific constraints or prior information to guide the clustering process, ensuring more accurate and meaningful results tailored to specific analytical needs. Traditional clustering methods are mostly limited and cannot include prior knowledge about the data, often leading to challenges in achieving meaningful or contextually relevant groupings.

Prior contexts can be included in the clustering process as

  • Pairwise constraints: Must-link / Must-Not-link constraints of data points to be / not in the same cluster
  • Triplet Constraints: With an anchor instance (a), a positive instance (p), and a negative instance (n), the constraint shows that instance a is more similar to p than to n.

A detailed introduction to the new function is given in the following blog post https://community.sap.com/t5/technology-blog-posts-by-sap/clustering-text-documents-using-constraine...

 

Further enhanced machine learning algorithms for Tabular AI scenarios

The recently implemented Multi-task MLP (multi-layer perceptron) neural network function, unlocking predictions for multiple targets / labels using a single model, has now been enhanced with an improved built-in model evaluation and parameter search interface, providing faster approach and productivity to achieve even better prediction outcomes.

The new optimization can be leveraged by calling the function directly or via its use within the Unified Classification/Regression functions and supports to search and select the following optimal neural network model parameter values

  • LEARNING_RATE, EMBEDDED_NUM, RESIDUAL_NUM, DROPOUT_PROB, HIDDEN_LAYER_SIZE, HIDDEN_LAYER_ACTIVE_FUNC, OPTIMIZER

 

In the domain of time series analysis and forecasting, the ARIMA forecasting function now supports to keep context of the time horizon index and interval

  • Trained ARIMA model keeps track of the final time index value and the determined index interval
  • Subsequently, ARIMA_PREDICT will use this information to output forecast values with a continuous index

 

Machine Learning experiment tracking and task scheduling

Experiment tracking and monitoring for PAL ML models

Machine learning experimentation requires robust tracking capabilities to ensure reproducibility, comparison, and auditability of models. SAP HANA Cloud's new ML tracking feature provides seamless integration with Predictive Analysis Library (PAL) procedures, enabling automatic logging of critical experiment artifacts. This end-to-end tracking solution captures parameters, datasets, models, metrics, and visualizations in a structured way, transforming how data scientists manage ML workflows.

The new execution tracking of PAL procedure supports

  • use with al Unified Regression/Classification-, Pipeline-fit/score procedures (with new interfaces suffix  “_TRACK” suffix)
  • tracking of content including: Parameters, Dataset Metadata, Model Signature, Metrics(including Variable Importance), Figures(discrete and continuous), etc.
  • management of tracking data in a built-in schema PAL_ML_TRACK and tables for metadata, log and header information
  • procedure parameters activate the tracking like LOG_ML_TRACK, TRACK_ID, etc.

A more detailed introduction is provided in the following blog post https://community.sap.com/t5/technology-blog-posts-by-sap/comprehensive-guide-to-mltrack-in-sap-hana...

 

Task scheduling for PAL procedures

A new PAL task scheduling allows you to run SQLScript procedures (calling PAL procedures) by the SAP HANA Cloud schedular asynchronously (cron-based). The targeted SQLScript (PAL) procedure calls get mapped to a define task with task ID, task descriptions, owner, etc. A task can be scheduled to executed, a job is the instance of scheduled task.

  • The built-in schema PAL_SCHEDULED_EXECUTION provides tables for storage of task definition/metadata, relationships, procedures parameters, TASK_SCHEDULE_JOB, task (operation) log
  • The actual job execution information is provided within the System views SCHEDULE_JOBS, M_SCHEDULE_JOBS
  • New PAL procedures are provided for task creation (AFLPAL_CREATE_TASK_PROC)
    and removal, scheduled job creation for a task (AFLPAL_CREATE_TASK_SCHEDULE_PROC), altering, pausing, resuming and removing task jobs

The python ML client adds additional interfaces and methods to leverage the new capabilities easily by experts developing HANA ML scenarios.

Data Drift detector with the Automated Predictive Library (APL)

The new data drift detector in the APL helps you spot changes or deviations between a given dataset and a reference. Reference data could be a version in the past, or a particular segment of customers or employees, or an expected distribution (e.g. Benford). Use cases of data comparison are:

  • This year employee survey results versus last year results by country
  • Machine learning inference dataset versus training dataset
  • Male staff versus female staff
  • Payment amounts by legal entity versus the Benford’s law to find potential fraud

This feature from APL (Automated Predictive Library) is available for both Python and SQL. For more details see blog post on the HANA ML Data Drift Detector.

ChristophMorgen_0-1752234591748.png

 

 

Python ML client (hana-ml) enhancements

The full list of new methods and enhancements with hana_ml 2.25  is summarized in the changelog for hana-ml 2.25 as part of the documentation. The key enhancements in this release include

Text and vector processing enhancements

  • New embedding model version SAP_GXY.20250407

Classification / regression function enhancements

  • Use of Multi-task MLP with UnifiedClassification | Regression
    • Benefit from unified interface capabilities like score-function, resampling/parameter search and optimization, model-report, …
  • HGBT regressor with Linear Tree trend extrapolation

AutoML and pipeline modeling improvements

  • Faster random search with hyperband optimization support
  • Progress monitor enhancements for fine-tuning and random search
  • HANA ML experiment tracking and task schedule execution
  • Experiment tracking and experiment monitor UI (detailed introduction is provided in the referenced blog post)
  • Enhanced PAL task schedule and job schedule execution UI

You can find an examples notebook illustrating the highlighted feature enhancements here 25QRC02_2.25.ipynb

 

1 Comment