13/14
Handling Outliers & Validation Β· Page 1 of 1

Detecting & Handling Outliers

Handling Outliers & Validation

What is an Outlier?

An outlier is an unusually extreme value that may:

  • Be a measurement error (typo: 500 instead of 50)
  • Represent rare but real events (fraud transaction)
  • Be simply natural variation (a person 7 feet tall)

Methods to Detect Outliers

1. Z-Score (Standardized Distance)

Values beyond Β±3 standard deviations are suspicious.

from scipy import stats

z_scores = np.abs(stats.zscore(df['salary']))
outliers = z_scores > 3  # True where outlier

2. IQR (Interquartile Range)

Outliers fall outside 1.5 Γ— IQR beyond quartiles.

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = (df['salary'] < Q1 - 1.5*IQR) | (df['salary'] > Q3 + 1.5*IQR)

3. Isolation Forest (ML-based)

Great for multivariate outliers.

from sklearn.ensemble import IsolationForest
clf = IsolationForest()
predictions = clf.fit_predict(df[['age', 'salary', 'experience']])
outliers = predictions == -1  # -1 = outlier

Handling Outliers

  • Remove (if clearly errors)
  • Cap (replace with 95th percentile)
  • Transform (log scale to reduce impact)
  • Keep but flag (for investigation)

Rule: Never blindly remove outliers. Investigate first!

main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…