Page7/20
Tokenization, Word Embeddings & Word2Vec Β· Page 1 of 2
From Text to Numbers
Text Processing & Embeddings
The Challenge: Text to Neural Networks
Neural networks need numbers, but we have text!
"The cat sat on the mat" β ???
Step 1: Tokenization
Break text into tokens (words, subwords, characters).
"The cat sat on the mat"
β
["The", "cat", "sat", "on", "the", "mat"]
Step 2: Vocabulary & Indexing
Map each token to an integer.
Vocabulary:
- "cat" β 2
- "mat" β 5
- "on" β 4
- "sat" β 3
- "the" β 1
Text: "The cat sat on the mat"
β
[1, 2, 3, 4, 1, 5]
Step 3: Convert to Vectors
Now the neural network can process!
[1, 2, 3, 4, 1, 5] β Neural Network β Output
One-Hot Encoding (Naive Approach)
Represent each word as a vector:
Vocabulary: {cat, dog, mat, sat, the}
"cat" β [1, 0, 0, 0, 0] (one-hot)
"dog" β [0, 1, 0, 0, 0]
"the" β [0, 0, 0, 0, 1]
Problem: No semantic relationship!
- "cat" and "dog" should be similar (both animals)
- "cat" and "pizza" should be different
- But one-hot gives them no relationship
Dense Word Embeddings (Better Approach)
Instead of sparse one-hot vectors, learn dense embeddings:
"cat" β [0.2, -0.5, 0.8, 0.1, -0.3] (5D embedding)
"dog" β [0.25, -0.48, 0.75, 0.15, -0.25] (similar!)
"the" β [-0.1, 0.2, -0.3, 0.8, 0.1] (different)
Magic: Similar words have similar vectors!
- Distance between "cat" and "dog" vectors: 0.05 β close!
- Distance between "cat" and "the" vectors: 1.2 β far!
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦