Page17/20
Weight Initialization, Regularization & Dropout Β· Page 1 of 2
Weight Initialization
Weight Initialization & Regularization
Why Weight Initialization Matters
Scenario 1: All weights = 0
All neurons produce same output
No diversity β Can't learn!
Scenario 2: Random huge weights (e.g., N(0, 100))
Activations explode β Gradients explode β Training unstable
Scenario 3: Random tiny weights (e.g., N(0, 0.0001))
Activations too small β Gradients vanish β Learning too slow
Goal: Find the Goldilocks zone!
Xavier (Glorot) Initialization
W ~ Uniform(-β(6/(n_in + n_out)), β(6/(n_in + n_out)))
Or Gaussian:
W ~ Normal(0, β(2/(n_in + n_out)))
Intuition: Scale weights based on layer size
- Large layer β smaller weights
- Small layer β larger weights
- Keeps activations from exploding/vanishing
When: For sigmoid/tanh layers
He Initialization
W ~ Normal(0, β(2/n_in))
Better for ReLU:
- ReLU doesn't saturate (unbounded on positive side)
- Can use slightly larger weights
- Better for deep networks
When: For ReLU layers (the modern default)
Comparison
Xavier: Works OK for sigmoid
He: Better for ReLU
Random: Bad! Don't use!
Modern practice: Use He initialization!
Layer Normalization / Batch Normalization
Problem: Even with good initialization, activations drift during training.
Solution: Normalize activations before each layer!
Batch Normalization:
x_norm = (x - batch_mean) / β(batch_var + Ξ΅)
x_scaled = Ξ³ Γ x_norm + Ξ²
Ξ³, Ξ² are learnable!
Effect: Stabilizes training, allows higher learning rates
Benefits:
- Faster convergence
- Less sensitive to initialization
- Acts as regularizer
- Allows higher learning rates
When: Add after dense/conv layers, before activation
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦