1. Introduction
Document clustering is to divide a collection of text documents into separate clusters based on the content presented in each document. However, since text is typically non-structured, while classical clustering methods are restricted to structured data only, one necessary step for clustering text documents is to convert non-structured text content into structured (i.e. tabular) data. Previously, text converters were mostly term-based (e.g. TF-IDF analysis), commonly resulting in representation of text documents in terms of high-dimensional sparse vectors. However, most documents clustering tasks are more likely to be semantic related. Thanks to the recent big progress of large language models (LLMs), today we can create more semantic related representation of text documents using advanced text embedding models. Different from term-based approaches like TF-IDF analysis, text embedding models usually produce dense numerical vectors of a pre-fixed dimensionality. It is natural that the use of text embeddings usually leads to more semantically meaningful clustering result. In this blogpost, we will demonstrate this assertion through some simple experiments.
2. Datasets and Experiments
The datasets used for demonstration are subsampled from 2 renowned experimental datasets: 20NewsGroup and AG News . The experimental datasets both come up with their true labels, i.e. are already categorized. Thus, we can evaluate the congruence between given category labels and the clustering labels using v-measure score. For comparison, baseline v-measure scores of the chosen datasets have been computed using latent semantic analysis (LSA in short, which is a combination of SVD and TF-IDF. vectorization) under KMeans clustering. In this blogpost, we shall see if we can do better with text embeddings. For the text vectorization we utilize the SAP HANA Cloud provided text embedding model (reference to text embedding blog and documentation)
For each dataset, let N be the number of classes in its true labels, we run KMeans clustering of N clusters with random initialization and a maximum number of 100 iterations repeatedly for 10 times, and then evaluate clustering result using v-measure score. Basically, v-measure score is a score of congruencies between data grouping through given true labels and data grouping via predicted clustering labels, the larger the better. A perfect score of 1.0 can be achieved only if the true labels and the clustering labels lead the same data grouping (irrespective of the groups’ order).
To facilitate the use of text embeddings for document clustering, the KMeans clustering algorithm in SAP HANA Cloud predictive analysis library (PAL) is enhanced with native support REAL_VECTOR data type, which is exactly the type of embedding vectors for non-structured inputs in SAP HANA Cloud database.
Key statistics of the obtained v-measure scores are listed in the following table:
Dataset | Average v-measure score using KMeans with text-vectors | SD v-measure score using KMeans with text-vectors | Min v-measure score using KMeans with text-vectors | Max v-measure score using KMeans with text-vectors | LSA Baseline Average |
20NewsGroup | 0.555247 | 0.000724 | 0.553948 | 0.556130 | ~0.44 |
AG News | 0.647168 | 0.000558 | 0. 645908 | 0. 647914 | ~0.35 |
From the table above, we can see that for the clustering result of 20NewsGroup data, the (average) v-measure score changes from 0.44 to 0.55; while for the clustering result o AG New data, the (average) v-measure score changes from 0.35 to approximately 0.65. Moreover, viewing from the standard-deviation, minimum and maximum of v-measure scores of the 10 repetitions, the clustering results for both datasets are very stable when text embedding vectors are used. This demonstrates the superiority of text embedding vectors towards vectors produced by traditional term-based vectorization methods in text/document clustering.
3. Code Recipes
In this section, we show the python code for obtaining the experimental results listed in the previous section (the code is restricted to the AG News data only, while the code for 20NewsGroup is nearly the same, with merely a change of dataset).
>>> agnews_full.head(3).collect()
>>> import numpy as np
>>> from hana_ml.algorithms.pal.clustering import KMeans
>>> from sklearn.metrics import v_measure_score
>>> labels = agnews_full[['Class Index']].collect()['Class Index']
>>> rept = 10
>>> v_ms = []
>>> for i in range(rept):
km = KMeans(n_clusters=4,
init='no_replace',
normalization='no',
max_iter=100)
agnews_clus = km.fit_predict(data=agnews_full[['ID', 'VECTOR_COL']],
key='ID')
agnews_clus_labels = agnews_clus.sort('ID')[['CLUSTER_ID']].collect()['CLUSTER_ID']
v_measure_ag = v_measure_score(labels_true=labels,
labels_pred=agnews_clus_labels)
v_ms.append(v_measure_ag)
>>> #compute the mean, min, max and std of the v_measure_score list v_ms
>>> np.mean(v_ms), np.min(v_ms), np.max(v_ms), np.std(v_ms)
3. Summary
In this blogpost we have shown how to use text embeddings vectors for document clustering. Numerical results shown that the use of text embedding vectors may significantly enhance the performance of the clustering algorithms compared to a typical traditional text vectorization technique LSA.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 27 | |
| 22 | |
| 15 | |
| 14 | |
| 12 | |
| 12 | |
| 10 | |
| 10 | |
| 10 | |
| 10 |