8/20
Tokenization, Word Embeddings & Word2Vec Β· Page 2 of 2

Word2Vec & Learning Embeddings

Word2Vec (Context-Based Embeddings)

Idea: Learn word embeddings by predicting context!

Skip-Gram Model

Train a network to: Given a word, predict surrounding words.

"the cat sat on the mat"

For word "cat":
- Context words: ["the", "sat"] (nearby words)

Network learns:
- Input: embed("cat")
- Output: predict ["the", "sat"]

By training on millions of examples:
- Similar words in similar contexts
- Similar embeddings!

Example Training

Sentence: "king queen man woman"

Training examples (predict next word):
- "king" β†’ "queen"
- "queen" β†’ "man"
- "man" β†’ "woman"

After training:
embed("king") - embed("man") β‰ˆ embed("queen") - embed("woman")

Why? Both pairs are {masculine β†’ feminine} relationships!

Semantic Algebra (Magic!)

Pre-trained embeddings capture semantics:

embed("king") - embed("man") + embed("woman") β‰ˆ embed("queen")

king is to man as queen is to woman!

embed("Paris") - embed("France") + embed("Germany") β‰ˆ embed("Berlin")

Paris is to France as Berlin is to Germany!

Pre-Trained Embeddings

Don't train from scratch! Use pre-trained:

  • Word2Vec: 300D vectors trained on Google News (billions of words)
  • GloVe: Global vectors, captures global word co-occurrence
  • FastText: Handles out-of-vocabulary words (subword information)

Advantage: These embeddings already understand language! Just load and use.

Embedding Dimensions

50D embeddings:   Fast, less memory, OK quality
100D embeddings:  Good balance
300D embeddings:  High quality (standard Word2Vec)
1000D embeddings: Very high quality, but slow

Most use 300D as sweet spot.

main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…