Page6/20
Loss Functions & Optimization (Adam, SGD) · Page 2 of 2
Modern Optimizers: SGD, Momentum, Adam
Optimization Algorithms
Problem with basic gradient descent:
- Slow convergence on large datasets
- Gets stuck in local minima
- No adaptive learning rates
Solution: Modern optimizers!
Stochastic Gradient Descent (SGD)
W := W - α × ∇W Loss
Same as basic gradient descent, but:
- Process mini-batches (32-256 samples) instead of all data
- Faster updates, escapes local minima
Advantage: Noisy gradient helps escape bad local minima Disadvantage: Jerky updates, slow convergence
SGD with Momentum
v := β × v + (1-β) × gradient
W := W - α × v
Keep moving in direction of previous gradients (momentum).
Intuition: Like rolling a ball downhill—builds up speed in good directions.
Good: All gradients point downhill → Fast!
Bad: Gradient reverses → Momentum keeps us going → Escapes!
β = 0.9 (most common) means: Use 90% of previous velocity.
Adam (Adaptive Moment Estimation) — Best for Most Tasks
Keep two running averages:
- m := β₁ × m + (1-β₁) × gradient (1st moment, mean)
- v := β₂ × v + (1-β₂) × gradient² (2nd moment, variance)
Adaptive learning rate:
W := W - α × m / (√v + ε)
Why Adam works:
- Adaptive: Different learning rates for different parameters
- Momentum: Accelerates in good directions
- Variance: Adapts to noisy gradients
- Works great in practice: Default choice for deep learning
Default parameters:
- α = 0.001
- β₁ = 0.9
- β₂ = 0.999
When to Use
| Optimizer | Speed | Stability | Use Case |
|---|---|---|---|
| SGD | Slow | Stable | Fine-tuning, small models |
| Momentum | Fast | Good | CNN, vision tasks |
| Adam | Very Fast | Excellent | RNN, transformers (DEFAULT) |
Modern practice: Start with Adam, rarely change it!
main.py
Loading...
OUTPUT
▶Click "Run Code" to execute…