Page8/20
Tokenization, Word Embeddings & Word2Vec Β· Page 2 of 2
Word2Vec & Learning Embeddings
Word2Vec (Context-Based Embeddings)
Idea: Learn word embeddings by predicting context!
Skip-Gram Model
Train a network to: Given a word, predict surrounding words.
"the cat sat on the mat"
For word "cat":
- Context words: ["the", "sat"] (nearby words)
Network learns:
- Input: embed("cat")
- Output: predict ["the", "sat"]
By training on millions of examples:
- Similar words in similar contexts
- Similar embeddings!
Example Training
Sentence: "king queen man woman"
Training examples (predict next word):
- "king" β "queen"
- "queen" β "man"
- "man" β "woman"
After training:
embed("king") - embed("man") β embed("queen") - embed("woman")
Why? Both pairs are {masculine β feminine} relationships!
Semantic Algebra (Magic!)
Pre-trained embeddings capture semantics:
embed("king") - embed("man") + embed("woman") β embed("queen")
king is to man as queen is to woman!
embed("Paris") - embed("France") + embed("Germany") β embed("Berlin")
Paris is to France as Berlin is to Germany!
Pre-Trained Embeddings
Don't train from scratch! Use pre-trained:
- Word2Vec: 300D vectors trained on Google News (billions of words)
- GloVe: Global vectors, captures global word co-occurrence
- FastText: Handles out-of-vocabulary words (subword information)
Advantage: These embeddings already understand language! Just load and use.
Embedding Dimensions
50D embeddings: Fast, less memory, OK quality
100D embeddings: Good balance
300D embeddings: High quality (standard Word2Vec)
1000D embeddings: Very high quality, but slow
Most use 300D as sweet spot.
main.py
Loading...
OUTPUT
βΆClick "Run Code" to executeβ¦