Page13/14
Handling Outliers & Validation Β· Page 1 of 1
Detecting & Handling Outliers
Handling Outliers & Validation
What is an Outlier?
An outlier is an unusually extreme value that may:
- Be a measurement error (typo: 500 instead of 50)
- Represent rare but real events (fraud transaction)
- Be simply natural variation (a person 7 feet tall)
Methods to Detect Outliers
1. Z-Score (Standardized Distance)
Values beyond Β±3 standard deviations are suspicious.
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers = z_scores > 3 # True where outlier
2. IQR (Interquartile Range)
Outliers fall outside 1.5 Γ IQR beyond quartiles.
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = (df['salary'] < Q1 - 1.5*IQR) | (df['salary'] > Q3 + 1.5*IQR)
3. Isolation Forest (ML-based)
Great for multivariate outliers.
from sklearn.ensemble import IsolationForest
clf = IsolationForest()
predictions = clf.fit_predict(df[['age', 'salary', 'experience']])
outliers = predictions == -1 # -1 = outlier
Handling Outliers
- Remove (if clearly errors)
- Cap (replace with 95th percentile)
- Transform (log scale to reduce impact)
- Keep but flag (for investigation)
Rule: Never blindly remove outliers. Investigate first!
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦