Module

Machine Learning Fundamentals

Progress90%

18 / 20 pages

Lesson 1: What is Machine Learning?

Lesson 2: Linear Regression from Scratch

Lesson 3: Visualizing the Loss Landscape

Lesson 4: Logistic Regression (Classification)

Lesson 5: K-Nearest Neighbors (Distance)

Lesson 6: Evaluation Metrics (From Scratch)

Lesson 7: Unsupervised Learning & K-Means

Lesson 8: Dimensionality Reduction with PCA

Lesson 9: Decision Trees & Splits

Lesson 10: Regularization (L1 & L2)

Lesson 11: K-Fold Cross Validation

Lesson 12: Naive Bayes — Probabilistic Classifier

Lesson 13: Support Vector Machines (SVM)

Lesson 14: Gradient Boosting & AdaBoost

Lesson 15: DBSCAN — Density-Based Clustering

Lesson 16: Gaussian Mixture Models (GMM)

Lesson 17: Ensemble Methods — Combine Multiple Models

Back to Module Overview

Alt+←/→to navigatePage18/2090

DBSCAN — Density-Based Clustering · Page 1 of 1

Beyond K-Means

25 min Advanced

DBSCAN — Density-Based Clustering

Problem with K-Means

K-Means assumes:

Clusters are roughly spherical
You know K (number of clusters) in advance
All clusters have similar sizes

Counter-examples:

Crescent-shaped clusters
Clusters of different sizes
Outliers shouldn't be in any cluster
Unknown number of clusters

DBSCAN (Density-Based Spatial Clustering)

Core Idea

Two points are in the same cluster if they're close together and surrounded by other close points.

How it works:

ε (epsilon): How close is "close"? (distance threshold)
minPts: Minimum neighbors within ε to be a core point
For each point:
- If >= minPts points within ε → Core point
- If close to core point → Border point
- Otherwise → Noise/Outlier

Parameters:

ε: Too small = all noise. Too large = one big cluster. Use k-distance graph.
minPts: Often 2×dimensions or 4. Balance sensitivity.

Advantages over K-Means

Pros:

✓ No need to specify K (discovers automatically)
✓ Finds arbitrary-shaped clusters
✓ Detects outliers (noise points)
✓ Robust to outliers

Cons:

✗ Sensitive to ε and minPts (must tune)
✗ Slower than K-Means (O(n²))
✗ Struggles with varying cluster densities
✗ High-dimensional curse (distances become less meaningful)

Choosing ε

Use the k-distance graph:

For each point, calculate distance to kth nearest neighbor
Sort distances
Plot as line graph
Find "elbow" where distances increase sharply
That's your ε

DBSCAN vs K-Means

Aspect	K-Means	DBSCAN
Cluster shape	Spherical	Any shape
Outliers	Forced into cluster	Labeled as noise
K required?	Yes	No
Scalability	Fast	Slow
Parameter tuning	Simple (just K)	Complex (ε, minPts)
Use case	Balanced, round clusters	Arbitrary shapes, unknown K

main.py

OUTPUT

▶Click "Run Code" to execute…