Clustering is a fundamental technique in data analysis and machine learning, widely used for identifying and grouping similar objects within a dataset based on inherent characteristics or patterns. It involves partitioning data points into distinct subsets known as clusters, where data points within the same cluster are more similar to each other than to those in other clusters. This unsupervised learning approach facilitates data exploration, pattern recognition, and information retrieval across various fields, from market segmentation to image analysis. However, traditional clustering methods operate without prior knowledge about the data, often leading to challenges in achieving meaningful or contextually relevant groupings. This limitation paves the way for constrained clustering, an advanced form of clustering that incorporates domain-specific constraints or prior information to guide the clustering process, ensuring more accurate and meaningful results tailored to specific analytical needs.
We are excited to announce the integration of a newly developed constrained clustering algorithm into SAP HANA Predictive Analysis Library (PAL). In this blog post, we will explore the fundamentals of constrained clustering and demonstrate its application in document clustering.
As data complexity increases, traditional clustering algorithms such as k-means and hierarchical clustering often fall short in capturing the intrinsic structure of the data. Deep clustering addresses these limitations by combining deep learning and clustering techniques to provide powerful representations and facilitate more accurate clustering. This approach serves as an excellent foundation for developing constrained clustering algorithms, which incorporate domain-specific constraints into the clustering process.
Deep Embedded Clustering (DEC) is a framework that leverages deep neural networks to learn low-dimensional representations of data, which are subsequently used for clustering. The key idea behind DEC is to use a deep autoencoder to obtain meaningful embeddings and then refine these embeddings through clustering-oriented learning. The components of DEC contains:
Constrained clustering is an extension of traditional clustering paradigms that incorporates domain-specific knowledge into the clustering process through constraints. This approach is particularly beneficial in scenarios where certain data points are known to belong together or certain separations should be maintained. By building on the DEC framework, constrained clustering can effectively leverage deep representations while respecting additional constraints. To integrate constraints into DEC, important modifications include:
In the following section, we will explore the application of constrained clustering in the context of document clustering.
Document clustering is a crucial process in the realm of natural language processing (NLP) and information retrieval. It involves grouping text documents into clusters, where documents in the same cluster share similar topics or contents. This technique has vast applications in organizing, filtering, and searching information, thus enhancing user experience in navigating vast text archives. The dataset for applying document clustering is the 20 Newsgroups dataset.
The 20 Newsgroups dataset is a benchmark dataset used extensively for text classification and clustering tasks. It comprises approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, each corresponding to a different topic. This dataset serves as a useful test bed for document clustering due to its diversity in topics and ease of accessibility. These topics range broadly from computer science themes such as "comp.graphics" to recreational or opinion-based themes like "rec.sport.baseball."
Before applying clustering algorithms, documents must be preprocessed to transform the text into a suitable format for analysis. This includes:
- Tokenization: Breaking down text into individual words or tokens.
- Stopwords Removal: Excluding common words that do not contribute to topic differentiation (e.g., "the," "is," "and").
- Stemming/Lemmatization: Reducing words to their base or root form (e.g., "running" to "run").
- Vectorization: Converting text into numerical vectors using techniques like TF-IDF or embeddings. Interested readers can find more information on text embeddings in the topics linked at the bottom of the page.
Once the document data has been vectorized, we can proceed to prepare the constraints information. For 20 Newsgroups datasets that includes target information, we employ a straightforward strategy to automatically generate constraints.
We randomly select pairs of instances to generate pairwise constraints. To ensure the constraints are transitive, we calculate the transitive closure for all instances that are must-linked. Subsequently, we derive additional constraints from the cannot-link constraints to maintain this property. The resulting constraints are entered into the constraints table. In this table, the first column specifies the type of constraint, with '1' representing a must-link and '-1' representing a cannot-link. The second and third columns contain the IDs of the instance pairs. Here is an example of a pairwise constraint table for illustration:
CONSTRAINT_TYPE | ID1 | ID2 |
1 | 0 | 1 |
-1 | 1 | 2 |
-1 | 2 | 3 |
Triplet constraints indicate that instance i is more similar to instance j than to instance k. To emulate human input on these triplet constraints, we randomly choose instances to serve as anchors (i). For each anchor, we then randomly select two instances (j and k) based on their similarity to the anchor. This similarity is determined by calculating the Euclidean distance between the preprocessed embeddings of the instances. The following is an example of generated triplet constraints specifying anchor, positive and nagetive instances:
ANCHOR | POSITIVE | NAGETIVE |
1 | 2 | 3 |
2 | 4 | 5 |
3 | 6 | 7 |
Once the input data and constraints are prepared, using constrained clustering is as straightforward as employing any other algorithm in the PAL. Below is an annotated code example for your reference.
--########## COLUMN TABLE CREATION ##########
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL ("PARAM_NAME" NVARCHAR (256), "INT_VALUE" INTEGER, "DOUBLE_VALUE" DOUBLE, "STRING_VALUE" NVARCHAR (1000));
CREATE COLUMN TABLE PAL_DCC_PREMODEL_TBL ("ROW_INDEX" INTEGER, "MODEL_CONTENT" NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_RESULT_TBL ("ID" INTEGER, "CLUSTER_ID" INTEGER, "CONFIDENCE" DOUBLE);
CREATE COLUMN TABLE PAL_DCC_MODEL_TBL ("ROW_INDEX" INTEGER, "MODEL_CONTENT" NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_LOG_TBL ("EPOCH" INTEGER, "BATCH" NVARCHAR (256), "VALUE" DOUBLE);
CREATE COLUMN TABLE PAL_DCC_STATISTICS_TBL ("STAT_NAME" NVARCHAR (256), "STAT_VALUE" NVARCHAR (1000), "STAT_VECTOR" REAL_VECTOR);
--########## TABLE INSERTS ##########
-- Assume the data is inserted in the PAL_DCC_DATA_TBL and the constraints are in the PAL_DCC_CONSTRAINTS_TBL table.
--########## PAL_PARAMETER_TBL DATA INSERTION ##########
-- mandatory, number of groups
INSERT INTO #PAL_PARAMETER_TBL VALUES ('GROUP_NUMBER', 4, NULL, NULL);
-- mandatory, type of constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CONSTRAINT_TYPE', 0, NULL, NULL);
-- hidden layer sizes of encoder
INSERT INTO #PAL_PARAMETER_TBL VALUES ('ENCODER_HIDDEN_DIMS', NULL, NULL, '50, 200');
-- dimension of latent space
INSERT INTO #PAL_PARAMETER_TBL VALUES ('EMBEDDING_DIM', 3, NULL, NULL);
-- whether to use normalization
INSERT INTO #PAL_PARAMETER_TBL VALUES ('NORMALIZATION', 1, NULL, NULL);
-- seed for random number generator
INSERT INTO #PAL_PARAMETER_TBL VALUES ('SEED', 1, NULL, NULL);
-- learning rate of pretraining
INSERT INTO #PAL_PARAMETER_TBL VALUES ('PRETRAIN_LEARNING_RATE', NULL, 0.01, NULL);
-- number of pretraining epochs
INSERT INTO #PAL_PARAMETER_TBL VALUES ('PRETRAIN_EPOCHS', 200, NULL, NULL);
-- number of training samples in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES ('PRETRAIN_BATCH_SIZE', 16, NULL, NULL);
-- ratio of total number of threads that can be used
INSERT INTO #PAL_PARAMETER_TBL VALUES ('THREAD_RATIO', NULL, 1.0, NULL);
-- degree of distorting latent space
INSERT INTO #PAL_PARAMETER_TBL VALUES ('GAMMA', NULL, 0.1, NULL);
-- pairwise, penalty for must-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('ML_PENALTY', NULL, 0.1, NULL);
-- pairwise, penalty for cannot-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CL_PENALTY', NULL, 1.0, NULL);
-- triplet, margin in triplet loss
INSERT INTO #PAL_PARAMETER_TBL VALUES ('THETA', NULL, 0.1, NULL);
-- learning rate
INSERT INTO #PAL_PARAMETER_TBL VALUES ('LEARNING_RATE', NULL, 0.01, NULL);
-- maximum number of training epochs
INSERT INTO #PAL_PARAMETER_TBL VALUES ('MAX_EPOCHS', 100, NULL, NULL);
-- number of training samples in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES ('BATCH_SIZE', 16, NULL, NULL);
-- frequency of updating target distribution
INSERT INTO #PAL_PARAMETER_TBL VALUES ('UPDATE_INTERVAL', 1, NULL, NULL);
-- pairwise, number of must-link constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES ('ML_BATCH_SIZE', 16, NULL, NULL);
-- pairwise, number of cannot-link constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CL_BATCH_SIZE', 16, NULL, NULL);
-- triplet, number of triplet constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES ('TRIPLET_BATCH_SIZE', 16, NULL, NULL);
-- pairwise, frequency of training with must-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('ML_UPDATE_INTERVAL', 1, NULL, NULL);
-- pairwise, frequency of training with cannot-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('CL_UPDATE_INTERVAL', 1, NULL, NULL);
-- triplet, frequency of training with triplet constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES ('TRIPTLET_UPDATE_INTERVAL', 1, NULL, NULL);
-- stopping threshold
INSERT INTO #PAL_PARAMETER_TBL VALUES ('TOLERANCE', NULL, 0.01, NULL);
-- verbosity of log
INSERT INTO #PAL_PARAMETER_TBL VALUES ('VERBOSE', 0, NULL, NULL);
--########## PAL_PIPELINE_EXPLAIN FOR TIME SERIES CALL ##########
DO BEGIN
lt_data = SELECT * FROM PAL_DCC_DATA_TBL;
lt_constraints = SELECT * FROM PAL_DCC_CONSTRAINTS_TBL;
lt_param = SELECT * FROM #PAL_PARAMETER_TBL;
lt_premodel = SELECT * FROM PAL_DCC_PREMODEL_TBL;
CALL _SYS_AFL.PAL_CONSTRAINED_CLUSTERING(:lt_data, :lt_constraints, :lt_param, :lt_premodel, lt_result, lt_model, lt_log, lt_statistics);
INSERT INTO PAL_DCC_RESULT_TBL SELECT * FROM :lt_result;
INSERT INTO PAL_DCC_MODEL_TBL SELECT * FROM :lt_model;
INSERT INTO PAL_DCC_LOG_TBL SELECT * FROM :lt_log;
INSERT INTO PAL_DCC_STATISTICS_TBL SELECT * FROM :lt_statistics;
END;
--########## SELECT * TABLES ##########
SELECT * FROM PAL_DCC_RESULT_TBL;
SELECT * FROM PAL_DCC_MODEL_TBL;
SELECT * FROM PAL_DCC_LOG_TBL;
You can view the clustering results, model, and training log in the output tables. Below are illustrative snapshots of the output tables.
The result table presents the assigned cluster ID for each instance and includes a third column that indicates the soft label or probabilities.
The model table documents crucial training information and the weights for the trained neural networks.
The entire training process can be monitored through the log table, which contains loss values for each training iteration. Additionally, the verbosity of the log can be adjusted using the VERBOSE parameter.
Deep clustering provides a robust framework that pushes the boundaries of unsupervised learning through the integration of deep learning. By extending DEC to incorporate constraints, we show a variant that combines the strengths of deep representations with domain-specific knowledge, enhancing clustering applicability in complex real-world problems.
Recent topics on HANA machine learning:
Hybrid Prediction with Tabular and Text Inputs using Hybrid Gradient Boosting Trees
Dimensionality Reduction of Text Embeddings for Hybrid Prediction Data
Text Embedding Service in SAP HANA Cloud Predictive Analysis Library (PAL)
Document Clustering using KMeans and Text Embeddings
New Text Analysis in SAP HANA Cloud Predictive Analysis Library (PAL)
New Information Retrieval Techniques in SAP HANA Cloud using BM25 and ANNS for Advanced Text Mining
Text Chunking – An Exciting New NLP Function in SAP HANA Cloud
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
8 | |
7 | |
6 | |
6 | |
6 | |
5 | |
4 | |
4 | |
4 | |
4 |