Page19/20
Gaussian Mixture Models (GMM) Β· Page 1 of 1
Soft Clustering with GMM
Gaussian Mixture Models (GMM)
Hard vs Soft Clustering
Hard Clustering (K-Means, DBSCAN)
Each point belongs to exactly ONE cluster.
Point: [5.1, 3.5] β Cluster 0 (100%)
Soft Clustering (GMM)
Each point has PROBABILITY of belonging to each cluster.
Point: [5.1, 3.5] β 70% Cluster 0, 30% Cluster 1
The Model
Assume each cluster is a Gaussian distribution (bell curve):
- Cluster A: Mean=ΞΌ_A, Covariance=Ξ£_A
- Cluster B: Mean=ΞΌ_B, Covariance=Ξ£_B
- ...
A data point is sampled from one of these Gaussians!
Graphically:
Two overlapping bell curves.
Point near the overlap belongs to both with high probability.
EM Algorithm (Expectation-Maximization)
- Initialization: Randomly place K Gaussians
- E-step (Expectation): For each point, calculate probability of belonging to each Gaussian
- M-step (Maximization): Update Gaussian parameters (ΞΌ, Ξ£) based on probabilities
- Repeat until convergence
Why it works:
- E-step: "Which cluster is this point from?"
- M-step: "Refit each cluster to its assigned points"
- Iterate until stable
Advantages & Disadvantages
Pros:
- β Probabilistic (know confidence)
- β Can handle overlapping clusters
- β More flexible than K-Means
- β Theoretical foundation
Cons:
- β Assumes Gaussian shape (may not hold)
- β Sensitive to number of components (K)
- β Slower than K-Means
- β Can get stuck in local optima
Choosing Number of Clusters
Use AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion):
- Train GMM with K=1,2,3,... up to 10
- Calculate AIC/BIC for each
- Lower is better
- Pick K with lowest BIC
GMM vs K-Means vs DBSCAN
| Aspect | K-Means | GMM | DBSCAN |
|---|---|---|---|
| Soft clusters? | No | Yes | No |
| Assumes shape | Spherical | Gaussian | Any |
| Speed | Fast | Medium | Slow |
| K needed? | Yes | Yes | No |
| Interpretability | High | Medium | Low |
| Output | Labels | Probabilities | Labels+Noise |
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦