Understanding K-Means and GMMs: A Deep Dive into EM Algorithms
Written on
Chapter 1: Introduction to Unsupervised Learning
Unsupervised machine learning involves training algorithms to distinguish between data points without any prior knowledge of true labels. While this approach can be more complex than supervised learning, it offers a unique and enjoyable experience. However, it also presents challenges, particularly in assessing algorithm performance. In this article, we will explore the mechanics of K-Means and Gaussian Mixture Models (GMMs), along with an intriguing evaluation metric known as completeness.
Section 1.1: The Mechanics of K-Means and GMMs
K-Means and GMMs are iterative algorithms that utilize the Expectation-Maximization (EM) technique. This method performs maximum likelihood estimation, optimizing the likelihood function of a distribution when latent variables are present. The process consists of two primary steps: the expectation step, where the algorithm estimates the parameters for the latent variables, and the maximization step, which refines those estimates. These steps continue until the algorithm converges.
K-Means employs Euclidean distance to assess similarity among data points. It begins by randomly selecting k centroids, with k being a hyperparameter. The algorithm then strives to minimize the distances within clusters while maximizing the differences between them. The loss function used can be expressed as follows:
Here, μ[j] denotes cluster number j, and ρi[j] is a boolean indicator variable indicating whether data point i is part of cluster j. The main goal of K-Means is to minimize the sum of the squared distances from each data point to its nearest centroid, optimizing the loss function through the EM algorithm. Although this method is guaranteed to converge, it may not always reach a global optimum.
In contrast, Gaussian Mixture Models involve a mixture of k features, each assumed to have a Gaussian distribution. The objective of GMMs is to estimate the parameters that best fit the combined features, which include the mean (μ) that represents the center, the covariance matrix that outlines the spread, and the mixing probability that indicates the size of the Gaussian distribution.
While K-Means operates under hard clustering—where each data point belongs entirely to a single cluster—GMMs employ soft clustering, assigning probabilities to each data point for belonging to various clusters. Additionally, K-Means performs well primarily with spherical clusters, whereas GMMs can effectively identify ellipsoidal clusters, often demonstrating superior performance.
Section 1.2: Evaluating Clustering with the Completeness Score
It's essential to differentiate the completeness score from accuracy since unsupervised algorithms lack knowledge of ground truth labels. Instead, these algorithms cluster data points without understanding the true nature of those clusters. Completeness is achieved when all data points within a given class are grouped into the same cluster, independent of the absolute label values.
Formally, completeness can be defined as follows:
In this equation, H represents the entropy function, Ypred is the predicted label, and Ytrue is the actual label. This score, derived from mathematical combinatorics, ranges from 0 to 1, with a score of 1 indicating that all members of a class are correctly clustered together.
Chapter 2: Conclusion
This discussion is part of my research on leveraging unsupervised machine learning to automate the analysis of large Whole Slide Images without supervision. I encourage you to explore my findings further! A highly recommended resource that significantly aided my understanding of these topics is "Hands-On Machine Learning with TensorFlow." Feel free to purchase the book through my affiliate link, as it has greatly contributed to my knowledge and career advancement.
If you're interested in staying updated with the latest AI and machine learning research, tutorials, and reviews, subscribe to my newsletter here:
The first video, "StatQuest: K-means clustering," delves into the foundational concepts of K-means clustering, offering insights into its mechanics and applications.
The second video, "What is KMeans Clustering? - A Quick Introduction to the Machine Learning Method," provides a concise overview of K-means clustering and its relevance in machine learning.