Concept
One-Hot vs Embedding
VOCABULARY: [cat, dog, fish, bird, car, bus]
ONE-HOT ENCODING (sparse):
cat → [1, 0, 0, 0, 0, 0]
dog → [0, 1, 0, 0, 0, 0]
fish → [0, 0, 1, 0, 0, 0]
Problems:
- No similarity info (cat-dog same as cat-car)
- Huge vectors for large vocabularies
- No generalization between words
EMBEDDING (dense):
cat → [0.2, -0.5, 0.8, 0.1]
dog → [0.3, -0.4, 0.7, 0.2] ← Similar to cat!
fish → [0.1, 0.6, -0.3, 0.4] ← Different
car → [-0.8, 0.1, 0.2, -0.6] ← Very different
Benefits:
- Captures similarity (cat ≈ dog)
- Compact (4 dims vs 50,000+)
- Generalizes: new words get meaningful vectors
Famous Example
Word Arithmetic
SEMANTIC ARITHMETIC (word2vec discovery):
king - man + woman ≈ queen
Vector space captures relationships:
- king and man are related (male royalty)
- queen and woman are related (female royalty)
- king - man = "royalty" concept
- "royalty" + woman = queen
MORE EXAMPLES:
Paris - France + Italy ≈ Rome
walked - walk + swim ≈ swam
bigger - big + small ≈ smaller
# Python code
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('vectors.bin')
result = model.most_similar(
positive=['king', 'woman'],
negative=['man']
)
# Returns: [('queen', 0.89), ...]
This works because embeddings encode relationships
as directions in vector space!