Module

Deep Learning & Neural Networks

Progress65%

13 / 20 pages

Lesson 1: Neurons & Perceptrons — Building Blocks

Lesson 2: Forward & Backpropagation — How Networks Learn

Lesson 3: Loss Functions & Optimization (Adam, SGD)

Lesson 4: Tokenization, Word Embeddings & Word2Vec

Lesson 5: Convolutional Neural Networks (CNN) — Image Processing

Lesson 6: Recurrent Neural Networks (RNN, LSTM, GRU)

Lesson 7: Attention Mechanisms & Transformers

Lesson 8: Generative Adversarial Networks (GAN)

Lesson 9: Weight Initialization, Regularization & Dropout

Lesson 10: Transfer Learning & Model Deployment

Back to Module Overview

Alt+←/→to navigatePage13/2065

Attention Mechanisms & Transformers · Page 1 of 2

Attention Mechanism

40 min Advanced

Attention & Transformers

The Problem with RNNs/LSTMs

Sequential processing is slow:

Step 1: Input word 1 → hidden state 1 (wait for completion)
Step 2: Input word 2 → hidden state 2 (can't start until step 1 done)
...
Step N: Process word N

For a 1000-word sentence: 1000 sequential steps! Bottleneck!

Solution: Process all words at once (parallelization)
But then how do we know which words relate to which?

Attention: "Focus on Relevant Words"

When processing a word, don't treat all other words equally. Attend to relevant ones.

Sentence: "The cat sat on the mat"
         word:    1   2   3  4  5   6

Processing word 3 ("sat"):
- High attention to "cat" (subject)
- Low attention to "the" (article)
- Medium attention to "mat" (object)

This tells the model which words matter for "sat"!

Self-Attention Mechanism

Query Q = W_q × x     (what am I looking for?)
Key K = W_k × x       (what do I have?)
Value V = W_v × x     (what info to use?)

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Intuition:

Q × K^T computes similarity between each pair
softmax converts to attention weights
Weights × V gives weighted average (which words matter)

Example: Computing Attention

Query for "sat": "what verbs happened?"
Keys for all words: ["noun", "noun", "verb", "preposition", "article", "noun"]
Similarity: [low, low, high, low, very_low, low]

After softmax: [0.01, 0.01, 0.85, 0.05, 0.005, 0.01]
↑ High attention to "sat" (the verb!)

Weighted sum of values: 0.85 × V_sat + 0.05 × V_on + ...
Result: Information focused on the verb!

Why Attention Works

Parallelization: All words processed simultaneously (GPU-friendly)
Long-range dependencies: Can attend to any word (no sequential bottleneck)
Interpretability: Attention weights show what model focused on
Flexibility: Can learn different attention patterns per layer

Multi-Head Attention

Use multiple attention mechanisms in parallel:

Head 1: Attends to subjects
Head 2: Attends to verbs
Head 3: Attends to adjectives
...

Concatenate: [head1_output, head2_output, head3_output, ...]

Why? Different "heads" learn different relationships!

Example: 8 heads × 64 dimensions = 512-dimensional output

main.py

OUTPUT

▶Click "Run Code" to execute…