Module

Intro to Pandas

Progress93%

13 / 14 pages

Lesson 1: DataFrames — Your Data Table

Lesson 2: Data Cleaning

Lesson 3: Feature Engineering with Apply

Lesson 4: Merging & Joining Data

Lesson 5: Working with Time Series

Lesson 6: Data Cleaning & Missing Values

Lesson 7: Pivoting & Reshaping Data

Lesson 8: Advanced Time Series & Resampling

Lesson 9: Statistical Analysis & Correlation

Lesson 10: Input/Output & File Formats

Lesson 11: Handling Outliers & Validation

Lesson 12: Advanced Data Transformations

Back to Module Overview

Alt+←/→to navigatePage13/1493

Handling Outliers & Validation · Page 1 of 1

Detecting & Handling Outliers

17 min Advanced

Handling Outliers & Validation

What is an Outlier?

An outlier is an unusually extreme value that may:

Be a measurement error (typo: 500 instead of 50)
Represent rare but real events (fraud transaction)
Be simply natural variation (a person 7 feet tall)

Methods to Detect Outliers

1. Z-Score (Standardized Distance)

Values beyond ±3 standard deviations are suspicious.

from scipy import stats

z_scores = np.abs(stats.zscore(df['salary']))
outliers = z_scores > 3  # True where outlier

2. IQR (Interquartile Range)

Outliers fall outside 1.5 × IQR beyond quartiles.

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = (df['salary'] < Q1 - 1.5*IQR) | (df['salary'] > Q3 + 1.5*IQR)

3. Isolation Forest (ML-based)

Great for multivariate outliers.

from sklearn.ensemble import IsolationForest
clf = IsolationForest()
predictions = clf.fit_predict(df[['age', 'salary', 'experience']])
outliers = predictions == -1  # -1 = outlier

Handling Outliers

Remove (if clearly errors)
Cap (replace with 95th percentile)
Transform (log scale to reduce impact)
Keep but flag (for investigation)

Rule: Never blindly remove outliers. Investigate first!

main.py

OUTPUT

▶Click "Run Code" to execute…