Page4/14
Data Cleaning · Page 1 of 1
Handling Missing Data
Data Cleaning
Why Does Dirty Data Exist?
Real-world datasets are messy. Missing values arise from:
- Data entry errors (human mistakes)
- System failures (sensors going offline)
- Merging datasets with different schemas
- Survey non-responses
Detecting Missing Values
df.isnull() # True where NaN
df.isnull().sum() # count per column
df.isnull().sum() / len(df) * 100 # percentage
Handling Missing Values
Strategy 1: Drop rows/columns
df.dropna() # drop rows with ANY NaN
df.dropna(subset=["salary"]) # drop only if salary is NaN
df.dropna(thresh=5) # keep rows with at least 5 non-NaN
Strategy 2: Fill / Impute
df.fillna(0) # fill all with 0
df["age"].fillna(df["age"].mean()) # fill with mean
df["city"].fillna("Unknown") # fill with constant
df.fillna(method="ffill") # forward fill
Best practice: Never blindly drop or fill. Always understand why data is missing.
main.py
Loading...
OUTPUT
▶Click "Run Code" to execute…