7/20
Tokenization, Word Embeddings & Word2Vec Β· Page 1 of 2

From Text to Numbers

Text Processing & Embeddings

The Challenge: Text to Neural Networks

Neural networks need numbers, but we have text!

"The cat sat on the mat" β†’ ???

Step 1: Tokenization

Break text into tokens (words, subwords, characters).

"The cat sat on the mat"
↓
["The", "cat", "sat", "on", "the", "mat"]

Step 2: Vocabulary & Indexing

Map each token to an integer.

Vocabulary:
- "cat" β†’ 2
- "mat" β†’ 5
- "on" β†’ 4
- "sat" β†’ 3
- "the" β†’ 1

Text: "The cat sat on the mat"
↓
[1, 2, 3, 4, 1, 5]

Step 3: Convert to Vectors

Now the neural network can process!

[1, 2, 3, 4, 1, 5] β†’ Neural Network β†’ Output

One-Hot Encoding (Naive Approach)

Represent each word as a vector:

Vocabulary: {cat, dog, mat, sat, the}

"cat" β†’ [1, 0, 0, 0, 0]  (one-hot)
"dog" β†’ [0, 1, 0, 0, 0]
"the" β†’ [0, 0, 0, 0, 1]

Problem: No semantic relationship!

  • "cat" and "dog" should be similar (both animals)
  • "cat" and "pizza" should be different
  • But one-hot gives them no relationship

Dense Word Embeddings (Better Approach)

Instead of sparse one-hot vectors, learn dense embeddings:

"cat" β†’ [0.2, -0.5, 0.8, 0.1, -0.3]  (5D embedding)
"dog" β†’ [0.25, -0.48, 0.75, 0.15, -0.25]  (similar!)
"the" β†’ [-0.1, 0.2, -0.3, 0.8, 0.1]  (different)

Magic: Similar words have similar vectors!

  • Distance between "cat" and "dog" vectors: 0.05 ← close!
  • Distance between "cat" and "the" vectors: 1.2 ← far!
main.py
Loading...
OUTPUT
β–ΆClick "Run Code" to execute…