A cluster model groups data so that objects in the same group have similar characteristics (so they are “homogeneous”) compared to those in other groups (which are “heterogeneous”). According to the
2017 KDnuggets poll, clustering is the second most popular machine learning method, and
k-means is undoubtedly one of the most popular algorithms used. It’s used in a wide variety of ways, for example to find groups of similar customers, to segment a market, to optimize a communication strategy and improve marketing effectiveness, to position products, or to select test markets.
Cluster models find groups of observations that are relatively close to one another, and therefore the concept of “distance” between the data points is important. Also, most k-means algorithms require the number of clusters, referred to as “k”, to be specified in advance, and this is one of the biggest drawbacks of the algorithm.
However, the algorithm is easy to understand. Basically, it’s an optimization problem: find the
k cluster centers and assign the objects to their nearest cluster center, such that the sum of the squared distance (measured from each object to its center) for the clusters is minimized. This results in the objects
in a cluster being as close as possible to its center, so their characteristics are therefore as similar as possible.
The algorithm is easy to follow in a simple 2-D representation, but in real-world situations you will be using many more dimensions. That means that instead of observations being described by a two-element vector (x
_{1}, x
_{2}), you will be working with observations described by an
n-element vector (x
_{1}, x
_{2}, …., x
_{n}), where each one of the elements are the different input variables that you choose to create the cluster groups.
You start with the observations in your data set. You will have prepared the data carefully, and the necessary data wrangling processes are described later in this blog.
You need to select how many clusters you want to find. Some implementations of k-means will build cluster models with a range of clusters that you specify and retain the model with the number of clusters considered to be the best. However, your first decision is to specify the number or range of clusters:
Step 1 – Assign one cluster center for each of the clusters.
Let’s say you decide to try a 3-cluster solution, so 3 cluster centers (
+) are selected at random observation locations:
Step 2: The algorithm calculates the distance between each observation to each cluster center (often using the Euclidean straight-line distance):
Step 3: Cluster “labels” are assigned so that each observation is assigned to its nearest cluster center:
Step 4: The algorithm calculates the mean value for every observation in each cluster (like a “group by” SQL statement for each cluster, then computing the mean). This mean value then becomes the new cluster center for each of the clusters:
Step 5: Repeat steps 2 through to 4 until the cluster membership no longer changes and the model has converged to the final solution:
As you can see, this technique requires a lot of trial and error (like when you are testing different
k values) and to find the optimal solution will involve some preliminary analysis, multiple runs, and evaluation of the different solutions.
Challenges of the K-Means Algorithm
So, this solution looks straight forward—so where are the challenges? What you are trying to achieve is a solution where observations in a cluster have a natural association, so they’re more similar to each other than to observations in another cluster. To do this, you are translating the concept of association into some sort of numerical measure of the degree of similarity.
The approach is to translate all dimensions into numeric values so that the observations are treated as points in space. If two points are close in a geometric sense, then you assume they’re similar.
There are two problems:
- Geometrically, the contributions of each dimension are of equal importance. However, the dimensions represent different characteristics with different categories and units of measurement, so a small change in one dimension may be much more important than a large change in another dimension. For example, if you have three dimensions, x_{1}, x_{2} and x_{3}, where x_{1} is measured in yards, x_{2} in centimetres and x_{3} in nautical miles, then a change of 1 in x_{3} is equivalent to a change of 185200 in x_{2} or 2025 in x_{1}. Therefore, these need to be converted to a common scale before distances make sense.
- Some variable types (like categorical variables), don’t have the right behaviour to be treated as components of a position vector.
Correctly Formatting Your Data
Therefore, one of the main challenges is to prepare the data so that it’s in the right format for the k-means algorithm to work correctly. There are a number of steps you should follow:
- Identify missing values and either fill them in or remove the records with the missing values.
- Identify outliers and either remove them or transform the value in some way so they are no longer influential.
- Identify any dimensions that have a skew and create transformations to adjust the data distribution so it has a normal distribution.
- If you have any categorical dimensions then you will need to create “dummy” (disjunctive) versions, where each category of each variable becomes a new numeric variable, with values of 0 or 1.
- Identify and remove any “redundant” dimensions. Redundant dimensions occur due to correlation between the explanatory dimensions (called “multi-collinearity”), and one approach is to analyse the correlation matrices, identify any pairs of dimensions with a high correlation, and then remove one of them so as to remove any redundancy from the model.
- Scale all the explanatory dimensions so that they are on the same scale and can be compared sensibly. You could use min-max normalization or z-scores for this.
There are many variations on all of these steps, resulting in the thousands of articles that have been written describing the different approaches. However, you’ll appreciate that with large numbers of dimensions there is a considerable amount of time and effort required to prepare the data correctly so that it can be used in this relatively simple algorithm.
The Auto Clustering Algorithm Approach
One approach I would recommend you check out is the auto clustering algorithm that is available in SAP Predictive Analytics. This uses a k-means, but with a slight twist. It can use a target variable (so it is a type of “supervised” clustering) and it has automated data encoding, so all of the data preparation steps are completed for you, saving you all that time and effort!