6/20
Loss Functions & Optimization (Adam, SGD) · Page 2 of 2

Modern Optimizers: SGD, Momentum, Adam

Optimization Algorithms

Problem with basic gradient descent:

  • Slow convergence on large datasets
  • Gets stuck in local minima
  • No adaptive learning rates

Solution: Modern optimizers!

Stochastic Gradient Descent (SGD)

W := W - α × ∇W Loss

Same as basic gradient descent, but:

  • Process mini-batches (32-256 samples) instead of all data
  • Faster updates, escapes local minima

Advantage: Noisy gradient helps escape bad local minima Disadvantage: Jerky updates, slow convergence

SGD with Momentum

v := β × v + (1-β) × gradient
W := W - α × v

Keep moving in direction of previous gradients (momentum).

Intuition: Like rolling a ball downhill—builds up speed in good directions.

Good: All gradients point downhill → Fast!
Bad: Gradient reverses → Momentum keeps us going → Escapes!

β = 0.9 (most common) means: Use 90% of previous velocity.

Adam (Adaptive Moment Estimation) — Best for Most Tasks

Keep two running averages:
- m := β₁ × m + (1-β₁) × gradient        (1st moment, mean)
- v := β₂ × v + (1-β₂) × gradient²       (2nd moment, variance)

Adaptive learning rate:
W := W - α × m / (√v + ε)

Why Adam works:

  • Adaptive: Different learning rates for different parameters
  • Momentum: Accelerates in good directions
  • Variance: Adapts to noisy gradients
  • Works great in practice: Default choice for deep learning

Default parameters:

  • α = 0.001
  • β₁ = 0.9
  • β₂ = 0.999

When to Use

OptimizerSpeedStabilityUse Case
SGDSlowStableFine-tuning, small models
MomentumFastGoodCNN, vision tasks
AdamVery FastExcellentRNN, transformers (DEFAULT)

Modern practice: Start with Adam, rarely change it!

main.py
Loading...
OUTPUT
Click "Run Code" to execute…