Alt+←/→to navigatePage7/2035
Tokenization, Word Embeddings & Word2Vec · Page 1 of 2
From Text to Numbers
30 min Intermediate
Text Processing & Embeddings
The Challenge: Text to Neural Networks
Neural networks need numbers, but we have text!
"The cat sat on the mat" → ???
Step 1: Tokenization
Break text into tokens (words, subwords, characters).
"The cat sat on the mat"
↓
["The", "cat", "sat", "on", "the", "mat"]
Step 2: Vocabulary & Indexing
Map each token to an integer.
Vocabulary:
- "cat" → 2
- "mat" → 5
- "on" → 4
- "sat" → 3
- "the" → 1
Text: "The cat sat on the mat"
↓
[1, 2, 3, 4, 1, 5]
Step 3: Convert to Vectors
Now the neural network can process!
[1, 2, 3, 4, 1, 5] → Neural Network → Output
One-Hot Encoding (Naive Approach)
Represent each word as a vector:
Vocabulary: {cat, dog, mat, sat, the}
"cat" → [1, 0, 0, 0, 0] (one-hot)
"dog" → [0, 1, 0, 0, 0]
"the" → [0, 0, 0, 0, 1]
Problem: No semantic relationship!
- "cat" and "dog" should be similar (both animals)
- "cat" and "pizza" should be different
- But one-hot gives them no relationship
Dense Word Embeddings (Better Approach)
Instead of sparse one-hot vectors, learn dense embeddings:
"cat" → [0.2, -0.5, 0.8, 0.1, -0.3] (5D embedding)
"dog" → [0.25, -0.48, 0.75, 0.15, -0.25] (similar!)
"the" → [-0.1, 0.2, -0.3, 0.8, 0.1] (different)
Magic: Similar words have similar vectors!
- Distance between "cat" and "dog" vectors: 0.05 ← close!
- Distance between "cat" and "the" vectors: 1.2 ← far!
main.py
Loading...
OUTPUT
▶Click "Run Code" to execute…