Technology Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
ChristophMorgen
Product and Topic Expert
Product and Topic Expert
499

With the SAP HANA Cloud 2025 Q3 release, several new embedded Machine Learning / AI functions have been released with the SAP HANA Cloud Predictive Analysis Library (PAL) and the Automated Predictive Library (APL).

 

Time series analysis and forecasting function enhancements

Threshold support in timeseries outlier detection

In time series, an outlier is a data point that is different from the general behavior of remaining data points.  In the PAL time series outlier detection function, the outlier detection task is divided into two steps

  • In step 1 the residual values are derived from the original series,
  • In step 2, the outliers are detected from the residual values.

Multiple methods are available to evaluate a data point to be an outlier or not.

  • Including Z1 score, Z2 score, IIQR score, MAD score, IsolationForest, DBSCAN
  • If used in combination, outlier voting can be applied for a combined evaluation. 

Now, new and in addition, thresholds values for outlier scores are supported

  • New parameter OUTPUT_OUTLIER_THRESHOLD
  • Based on the given threshold value, if the time series value is beyond the (upper and lower) outlier threshold for the time series, the corresponding data point as an outlier.
  • Only valid when outlier_method = 'iqr', 'isolationforest', 'mad', 'z1', 'z2'.

ChristophMorgen_0-1767958753257.jpeg

 

 

Classification and regression function enhancements

Corset sampling support with SVM models

Coreset sampling is a machine learning technique to

  • select a small, representative subset (the "coreset") from larger datasets,
  • enabling faster, more efficient training and processing while maintaining similar model accuracy as using the full data.
  • It works by identifying the most "informative" samples, filtering out redundant or noisy data, and allowing complex algorithms to run on a manageable dataset sizes.

Support Vector Machine (SVM) model training is computationally expensive, and computational costs are specifically sensitive to the number of training points, which makes SVM models often impractical for large datasets. 

Therefore SVM in the Predictive Analysis Library has been enhanced and now

  • offers embedded coreset sampling capabilities
  • enabled with the new parameters USE_CORESET and CORESET_SCALE as the sampling ratio when constructing coreset.

This enhancement significantly reduces SVM training time with minimal impact on accuracy.

ChristophMorgen_1-1767958753264.png

 

 

AutoML and pipeline function enhancements

Target encoding support in AutoML 

The PAL AutoML framework introduces a new pipeline operator for target encoding of categorial features

  • Categorical data is often required to be preprocessed and required to get converted from non-numerical features into formats suitable for the respective machine learning algorithm, i.e. numeric values
    • Examples features: text labels (e.g., “red,” “blue”) or discrete categories (e.g., “high,” “medium,” “low”)
  • One-hot encoding converts each categorial feature value  into a binary column (0 or 1), which works well for features with a limited number of unique values. PAL already applies an optimized one-hot encoding method aggregating very low frequent values.
  • Target encoding replaces the categorial values with the mean of the target / label column for high-cardinality features, which avoids to create large and sparse one-hot encoded feature matrices
    • Example of a high cardinality feature: “city” column with hundreds-thousands of unique values, postal code, product IDs etc.

The PAL AutoML engine will analyze the input feature cardinality and then automatically decide if to apply target encoding or another encoding method. For medium to high cardinality categorial features, target encoding may improve the performance significantly.

By automating target encoding, the PAL AutoML engine aims to improve model performance and generalization, especially when dealing with complex, high-cardinality categorical features, without requiring manual intervention.

In addition, the AutoML and pipeline function now also support columns of type half precision vector.

 

Misc. Machine Learning and statistics function enhancements

High-dimensional feature data reduction using UMAP

UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction algorithm used to simplify complex, high-dimensional feature spaces, while preserving its essential structure. It is widely considered the modern gold standard for visualizing targeted dimension reduction of large-scale datasets, because it balances computational speed with the ability to maintain both local and global relationships.

  • It reduces thousands of variables (dimensions) into 2D or 3D scatter plots that humans can easily interpret.
  • Unlike comparable methods like t-SNE, UMAP is better at preserving global structure, meaning the relative positions between different clusters remain more meaningful.
  • It is significantly faster and more memory-efficient than t-SNE, capable of processing datasets with millions of points in a reasonable timeframe.
  • It can be used as a "transformer" preprocessing step in Machine Learning scenarios to reduce large feature spaces before applying clustering (e.g., k-means, HDBSCAN) or classification models, often improving their performance.

The following new functions are introduced

  • _SYS_AFL.PAL_UMAP​ with the most important parameters N_NEIGHBORS, MIN_DIST, N_COMPONENTS, DISTANCE_LEVEL
  • _SYS_AFL.PAL_TRUSTWORTHINESS​, used to measure the structure similarity between original high dimensional space and embedded low dimensional space based on K nearest neighbors.

 

Calculating pairwise distances

Many algorithms, for example clustering algorithms utilize distance matrixes as a preprocessing step, often inbuild to the functions. While often there is the wish to decouple though the distance matrix calculation from the follow-up task like the actual clustering. Moreover, if decoupled custom calculated matrixes can be fed into algorithms as input.

  • Most PAL clustering functions support to feed-in a pre-calculated similarity matrix

Now, a dedicated pairwise distance calculation function is provided

  • It supports distance metrics like Manhattan, Euclidien, Minkowski, Chebyshey as well as Levenshtein
  • The Levenshtein distance (or “edit distance”) is a distance metric specifically targeting distance between text-columns.
    • It calculates the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one word into another, acting as a measure of their similarity. A lower distance indicates a higher similarity.

Applicable use cases

  • It is useful in data cleaning, table column similarity analysis between columns of the same data type.
  • After calculating the column similarity across all data types, clustering like K-Means can be applied to group similar fields and propose mappings for fields within the same cluster

 

Real Vector data type support

The following PAL functions have been enhanced to support columns of type real vector

  • Spectral Clustering
  • Cluster Assignment
  • Decision tree
  • Sampling

In addition the AutoML and pipeline function now also support columns of type half precision vector.

 

Creating Vector Embeddings enhancements

The SAP HANA Database Vector Engine function VECTOR_EMBEDDING() has added support for remote, SAP AI Core exposed embedding models. Detailed instruction are given in the documentation at Creating Text Embeddings with SAP AI Core | SAP Help Portal

 

Python ML client (hana-ml) enhancements

The full list of new methods and enhancements with hana_ml 2.26  is summarized in the changelog for hana-ml 2.26 as part of the documentation. The key enhancements in this release include

New Functions

  • Added text tokenization API.
  • Added explainability support with IsolationForest Outlier Detection
  • Added constrained clustering API.
  • Added intermittent time series data test in time series report.

Enhancements

  • Support time series SHAP visualizations for AutoML Timeseries model explanations

You can find an examples notebook illustrating the highlighted feature enhancements here 25QRC03_2.26.ipynb