4/14
Data Cleaning · Page 1 of 1

Handling Missing Data

Data Cleaning

Why Does Dirty Data Exist?

Real-world datasets are messy. Missing values arise from:

  • Data entry errors (human mistakes)
  • System failures (sensors going offline)
  • Merging datasets with different schemas
  • Survey non-responses

Data cleaning process

Detecting Missing Values

df.isnull()           # True where NaN
df.isnull().sum()     # count per column
df.isnull().sum() / len(df) * 100  # percentage

Handling Missing Values

Strategy 1: Drop rows/columns

df.dropna()                   # drop rows with ANY NaN
df.dropna(subset=["salary"])  # drop only if salary is NaN
df.dropna(thresh=5)           # keep rows with at least 5 non-NaN

Strategy 2: Fill / Impute

df.fillna(0)                          # fill all with 0
df["age"].fillna(df["age"].mean())    # fill with mean
df["city"].fillna("Unknown")          # fill with constant
df.fillna(method="ffill")             # forward fill

Best practice: Never blindly drop or fill. Always understand why data is missing.

main.py
Loading...
OUTPUT
Click "Run Code" to execute…