13/20
Attention Mechanisms & Transformers Β· Page 1 of 2

Attention Mechanism

Attention & Transformers

The Problem with RNNs/LSTMs

Sequential processing is slow:

Step 1: Input word 1 β†’ hidden state 1 (wait for completion)
Step 2: Input word 2 β†’ hidden state 2 (can't start until step 1 done)
...
Step N: Process word N

For a 1000-word sentence: 1000 sequential steps! Bottleneck!

Solution: Process all words at once (parallelization)
But then how do we know which words relate to which?

Attention: "Focus on Relevant Words"

When processing a word, don't treat all other words equally. Attend to relevant ones.

Sentence: "The cat sat on the mat"
         word:    1   2   3  4  5   6

Processing word 3 ("sat"):
- High attention to "cat" (subject)
- Low attention to "the" (article)
- Medium attention to "mat" (object)

This tells the model which words matter for "sat"!

Self-Attention Mechanism

Query Q = W_q Γ— x     (what am I looking for?)
Key K = W_k Γ— x       (what do I have?)
Value V = W_v Γ— x     (what info to use?)

Attention(Q, K, V) = softmax(Q Γ— K^T / √d_k) Γ— V

Intuition:

  1. Q Γ— K^T computes similarity between each pair
  2. softmax converts to attention weights
  3. Weights Γ— V gives weighted average (which words matter)

Example: Computing Attention

Query for "sat": "what verbs happened?"
Keys for all words: ["noun", "noun", "verb", "preposition", "article", "noun"]
Similarity: [low, low, high, low, very_low, low]

After softmax: [0.01, 0.01, 0.85, 0.05, 0.005, 0.01]
↑ High attention to "sat" (the verb!)

Weighted sum of values: 0.85 Γ— V_sat + 0.05 Γ— V_on + ...
Result: Information focused on the verb!

Why Attention Works

  1. Parallelization: All words processed simultaneously (GPU-friendly)
  2. Long-range dependencies: Can attend to any word (no sequential bottleneck)
  3. Interpretability: Attention weights show what model focused on
  4. Flexibility: Can learn different attention patterns per layer

Multi-Head Attention

Use multiple attention mechanisms in parallel:

Head 1: Attends to subjects
Head 2: Attends to verbs
Head 3: Attends to adjectives
...

Concatenate: [head1_output, head2_output, head3_output, ...]

Why? Different "heads" learn different relationships!

Example: 8 heads Γ— 64 dimensions = 512-dimensional output

main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…