Word Embeddings



Word Embeddings

Word2Vec, GloVe & FastText — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

  • Definition: Word embeddings map words to dense, low-dimensional vectors that capture semantic meaning — similar words have similar vectors.
  • Problem solved: One-hot encoding has no notion of similarity. Embeddings encode relationships like king − man + woman ≈ queen.
  • Word2Vec: Learns embeddings by predicting words from context (CBOW) or context from words (Skip-gram).
  • GloVe: Learns from global co-occurrence statistics. Often slightly better than Word2Vec on analogy tasks.
  • FastText: Uses character n-grams — handles unknown words and works well for Indian languages.
  • Similarity: Measured using cosine similarity — values range from −1 to +1; +1 = identical direction.

1. The Problem with One-Hot Encoding

One-hot encoding represents each word as a vector of zeros with a single 1 at the word’s index. For a vocabulary of 50,000 words, each word becomes a 50,000-dimensional vector — mostly zeros.

This has two critical problems:

  1. High dimensionality: 50,000-dimensional vectors are computationally expensive and suffer from the curse of dimensionality.
  2. No semantic similarity: Every pair of words is equally distant from every other pair. The cosine similarity between “cat” and “kitten” is the same as between “cat” and “aeroplane” — both are 0. The model has no way to know that cats and kittens are related.

Word embeddings solve both problems: they are low-dimensional (typically 100–300 dimensions) and encode semantic relationships — “cat” and “kitten” are close; “cat” and “aeroplane” are far apart.

2. What are Word Embeddings?

Word embeddings are dense, real-valued vector representations of words learned from large text corpora, where words with similar meanings are mapped to similar positions in vector space.

The key insight behind word embeddings is the distributional hypothesis: “words that appear in similar contexts tend to have similar meanings” (Firth, 1957). A neural network trained to predict missing words learns, as a byproduct, that words appearing in similar contexts (like “cat” and “dog” — both appear near “pet”, “fur”, “veterinarian”) should have similar representations.

The classic demonstration of learned semantic structure: king − man + woman ≈ queen. Vector arithmetic on word embeddings captures semantic relationships — gender, tense, geography, and more — all learned automatically from raw text.

3. Word2Vec — CBOW & Skip-gram

Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict words from context. The embedding vectors are the weights learned in the hidden layer — they are never directly trained to be embeddings; they emerge as a byproduct of the prediction task.

CBOW — Continuous Bag of Words

Given a context window of surrounding words, predict the target (centre) word.

Example (window=2): Context = [“The”, “cat”, “_”, “on”, “the”] → Predict: “sat”

CBOW averages the context word vectors to predict the target. It is faster to train and works better for frequent words and smaller datasets.

Skip-gram

Given a target word, predict the surrounding context words.

Example: Input = “sat” → Predict: [“The”, “cat”, “on”, “the”]

Skip-gram trains more slowly but produces better embeddings for rare words, making it better for large vocabularies and large corpora. Skip-gram with Negative Sampling (SGNS) is the most commonly used variant in practice.

FeatureCBOWSkip-gram
InputContext words → predict targetTarget word → predict context
Training speedFasterSlower
Rare wordsWorseBetter
Best forSmall datasets, frequent wordsLarge corpora, rare words

4. GloVe — Global Vectors for Word Representation

GloVe (Pennington et al., Stanford, 2014) takes a different approach from Word2Vec. Instead of using local context windows, it builds a global word co-occurrence matrix — counting how often each word pair appears together across the entire corpus — and then factorises this matrix to learn embeddings.

The key insight: the ratio of co-occurrence probabilities between word pairs encodes meaningful relationships. GloVe trains embeddings so that the dot product of two word vectors approximates the log of their co-occurrence probability.

GloVe pre-trained vectors are available in 50, 100, 200, and 300 dimensions, trained on Wikipedia and Common Crawl. For most NLP tasks, downloading and using pre-trained GloVe vectors is more practical than training Word2Vec from scratch.

5. FastText

FastText (Bojanowski et al., Facebook, 2017) extends Word2Vec by representing each word as a bag of character n-grams. For example, the word “engineering” with n=3 is represented as: <en, eng, ngi, gin, ine, nee, eer, eri, rin, ing, ng> plus the whole word.

Key advantages:

  • Out-of-vocabulary words: FastText can generate embeddings for words never seen during training by composing their character n-gram vectors. Word2Vec and GloVe cannot handle unknown words at all.
  • Morphologically rich languages: Works much better for Hindi, Tamil, German, and other languages with complex morphology where many word forms exist.
  • Misspellings: Since similar character sequences have similar embeddings, FastText is more robust to typos and spelling variations.

FastText is the recommended choice for Indian language NLP and any application where out-of-vocabulary words are expected.

6. Comparison Table

FeatureWord2VecGloVeFastText
DeveloperGoogle (2013)Stanford (2014)Facebook (2017)
Training approachLocal context windowGlobal co-occurrence matrixCharacter n-grams + context
OOV wordsCannot handleCannot handleHandles via subwords
Rare wordsPoor (Skip-gram better)ModerateGood
MorphologyNoNoYes
Pre-trained vectorsGoogle News (3M words)Wikipedia, Common Crawl157 languages available
Best forGeneral English NLPAnalogy tasks, general NLPIndian languages, OOV words

7. Cosine Similarity

The similarity between two word vectors is measured using cosine similarity — the cosine of the angle between the vectors:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

ValueInterpretationExample
+1.0Identical direction — most similar“king” vs “king”
~0.8Very similar“cat” vs “kitten”
~0.5Somewhat related“cat” vs “animal”
~0.0Unrelated“cat” vs “democracy”
−1.0Opposite direction“good” vs “bad” (in some spaces)

8. Word Analogies — The Famous King − Man + Woman = Queen

One of the most striking demonstrations of word embedding quality is that vector arithmetic captures semantic relationships:

  • Gender: king − man + woman ≈ queen
  • Capital cities: Paris − France + Germany ≈ Berlin
  • Verb tense: running − run + walk ≈ walking
  • Comparative: bigger − big + small ≈ smaller

This happens because embeddings encode relationships as directions in vector space. The “royalty” direction is similar for both male and female royals; the “gender” direction is consistent across many word pairs. Linear combinations of these directions produce the observed analogies.

9. Python Code


import gensim.downloader as api
from gensim.models import Word2Vec, FastText
import numpy as np

# --- Load pre-trained Word2Vec (Google News) ---
# Note: Large download (~1.6GB) — use smaller model for testing
model_w2v = api.load('word2vec-google-news-300')

# Word similarity
print("Similarity (cat, kitten):", model_w2v.similarity('cat', 'kitten'))
print("Similarity (cat, dog):",    model_w2v.similarity('cat', 'dog'))
print("Similarity (cat, car):",    model_w2v.similarity('cat', 'car'))

# Most similar words
print("\nMost similar to 'engineering':")
print(model_w2v.most_similar('engineering', topn=5))

# Word analogy: king - man + woman = ?
result = model_w2v.most_similar(positive=['king', 'woman'], negative=['man'])
print(f"\nking - man + woman = {result[0][0]}")

# --- Load pre-trained GloVe (smaller, faster) ---
model_glove = api.load('glove-wiki-gigaword-100')
print("\nGloVe similarity (Paris, London):", model_glove.similarity('paris', 'london'))

# --- Train Word2Vec from scratch ---
sentences = [
    ["machine", "learning", "is", "powerful"],
    ["deep", "learning", "uses", "neural", "networks"],
    ["machine", "learning", "and", "deep", "learning", "are", "related"]
]
custom_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=100)
print("\nCustom model vector for 'learning':", custom_model.wv['learning'][:5], "...")

# --- FastText for OOV words ---
ft_model = FastText(sentences, vector_size=100, window=5, min_count=1, epochs=100)
# FastText can handle words not in training vocabulary
print("FastText OOV vector for 'learningg' (typo):", ft_model.wv['learningg'][:5], "...")
    

10. Frequently Asked Questions

Should I train my own embeddings or use pre-trained ones?

Use pre-trained embeddings unless you have a very large domain-specific corpus (millions of documents) and the domain is significantly different from general text (e.g., medical jargon, legal text, code). Pre-trained GloVe or FastText vectors work excellently as starting points and can be fine-tuned on your task. Training from scratch requires massive data and compute that most projects do not have.

Are word embeddings still relevant with transformers?

Traditional static word embeddings (Word2Vec, GloVe) have largely been replaced by contextual embeddings from transformers (BERT, GPT) for state-of-the-art NLP. However, static embeddings are still widely used in production for: low-latency applications (they are much faster), resource-constrained environments (smaller models), and as features for simple ML models. Understanding word embeddings also provides the conceptual foundation for understanding transformer embeddings.

Next Steps