Word2Vec is a neural network-based technique developed by Google (2013) that learns word embeddings from large text corpora. It has two architectures: CBOW (Continuous Bag of Words) predicts a target word from surrounding context words, and Skip-gram predicts surrounding context words given a target word. Both produce dense vector representations that capture semantic relationships between words.

What is the difference between Word2Vec, GloVe, and FastText?

Word2Vec learns embeddings using local context windows (predicting words from their neighbours). GloVe (Global Vectors) uses global word co-occurrence statistics across the entire corpus — combining the strengths of matrix factorisation and local context methods. FastText extends Word2Vec by representing words as bags of character n-grams — this allows it to generate embeddings for words not seen during training (out-of-vocabulary words) and works better for morphologically rich languages.

Word Embeddings

Q: What are word embeddings?

Word embeddings are dense, low-dimensional vector representations of words that capture semantic meaning. Unlike one-hot encoding where every word is equally different from every other word, word embeddings place semantically similar words close together in vector space. For example, 'king' and 'queen' have similar embeddings, while 'king' and 'car' are far apart.

Word2Vec, GloVe & FastText — Explained for Engineering Students

Last Updated: March 2026

📌 Key Takeaways

Definition: Word embeddings map words to dense, low-dimensional vectors that capture semantic meaning — similar words have similar vectors.
Problem solved: One-hot encoding has no notion of similarity. Embeddings encode relationships like king − man + woman ≈ queen.
Word2Vec: Learns embeddings by predicting words from context (CBOW) or context from words (Skip-gram).
GloVe: Learns from global co-occurrence statistics. Often slightly better than Word2Vec on analogy tasks.
FastText: Uses character n-grams — handles unknown words and works well for Indian languages.
Similarity: Measured using cosine similarity — values range from −1 to +1; +1 = identical direction.

1. The Problem with One-Hot Encoding

One-hot encoding represents each word as a vector of zeros with a single 1 at the word’s index. For a vocabulary of 50,000 words, each word becomes a 50,000-dimensional vector — mostly zeros.

This has two critical problems:

High dimensionality: 50,000-dimensional vectors are computationally expensive and suffer from the curse of dimensionality.
No semantic similarity: Every pair of words is equally distant from every other pair. The cosine similarity between “cat” and “kitten” is the same as between “cat” and “aeroplane” — both are 0. The model has no way to know that cats and kittens are related.

Word embeddings solve both problems: they are low-dimensional (typically 100–300 dimensions) and encode semantic relationships — “cat” and “kitten” are close; “cat” and “aeroplane” are far apart.

2. What are Word Embeddings?

Word embeddings are dense, real-valued vector representations of words learned from large text corpora, where words with similar meanings are mapped to similar positions in vector space.

The key insight behind word embeddings is the distributional hypothesis: “words that appear in similar contexts tend to have similar meanings” (Firth, 1957). A neural network trained to predict missing words learns, as a byproduct, that words appearing in similar contexts (like “cat” and “dog” — both appear near “pet”, “fur”, “veterinarian”) should have similar representations.

The classic demonstration of learned semantic structure: king − man + woman ≈ queen. Vector arithmetic on word embeddings captures semantic relationships — gender, tense, geography, and more — all learned automatically from raw text.

3. Word2Vec — CBOW & Skip-gram

Word2Vec (Mikolov et al., Google, 2013) trains a shallow neural network to predict words from context. The embedding vectors are the weights learned in the hidden layer — they are never directly trained to be embeddings; they emerge as a byproduct of the prediction task.

CBOW — Continuous Bag of Words

Given a context window of surrounding words, predict the target (centre) word.

Example (window=2): Context = [“The”, “cat”, “_”, “on”, “the”] → Predict: “sat”

CBOW averages the context word vectors to predict the target. It is faster to train and works better for frequent words and smaller datasets.

Skip-gram

Given a target word, predict the surrounding context words.

Example: Input = “sat” → Predict: [“The”, “cat”, “on”, “the”]

Skip-gram trains more slowly but produces better embeddings for rare words, making it better for large vocabularies and large corpora. Skip-gram with Negative Sampling (SGNS) is the most commonly used variant in practice.

Feature	CBOW	Skip-gram
Input	Context words → predict target	Target word → predict context
Training speed	Faster	Slower
Rare words	Worse	Better
Best for	Small datasets, frequent words	Large corpora, rare words

4. GloVe — Global Vectors for Word Representation

GloVe (Pennington et al., Stanford, 2014) takes a different approach from Word2Vec. Instead of using local context windows, it builds a global word co-occurrence matrix — counting how often each word pair appears together across the entire corpus — and then factorises this matrix to learn embeddings.

The key insight: the ratio of co-occurrence probabilities between word pairs encodes meaningful relationships. GloVe trains embeddings so that the dot product of two word vectors approximates the log of their co-occurrence probability.

GloVe pre-trained vectors are available in 50, 100, 200, and 300 dimensions, trained on Wikipedia and Common Crawl. For most NLP tasks, downloading and using pre-trained GloVe vectors is more practical than training Word2Vec from scratch.

5. FastText

FastText (Bojanowski et al., Facebook, 2017) extends Word2Vec by representing each word as a bag of character n-grams. For example, the word “engineering” with n=3 is represented as: <en, eng, ngi, gin, ine, nee, eer, eri, rin, ing, ng> plus the whole word.

Key advantages:

Out-of-vocabulary words: FastText can generate embeddings for words never seen during training by composing their character n-gram vectors. Word2Vec and GloVe cannot handle unknown words at all.
Morphologically rich languages: Works much better for Hindi, Tamil, German, and other languages with complex morphology where many word forms exist.
Misspellings: Since similar character sequences have similar embeddings, FastText is more robust to typos and spelling variations.

FastText is the recommended choice for Indian language NLP and any application where out-of-vocabulary words are expected.

6. Comparison Table

Feature	Word2Vec	GloVe	FastText
Developer	Google (2013)	Stanford (2014)	Facebook (2017)
Training approach	Local context window	Global co-occurrence matrix	Character n-grams + context
OOV words	Cannot handle	Cannot handle	Handles via subwords
Rare words	Poor (Skip-gram better)	Moderate	Good
Morphology	No	No	Yes
Pre-trained vectors	Google News (3M words)	Wikipedia, Common Crawl	157 languages available
Best for	General English NLP	Analogy tasks, general NLP	Indian languages, OOV words

7. Cosine Similarity

The similarity between two word vectors is measured using cosine similarity — the cosine of the angle between the vectors:

cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)

Value	Interpretation	Example
+1.0	Identical direction — most similar	“king” vs “king”
~0.8	Very similar	“cat” vs “kitten”
~0.5	Somewhat related	“cat” vs “animal”
~0.0	Unrelated	“cat” vs “democracy”
−1.0	Opposite direction	“good” vs “bad” (in some spaces)

8. Word Analogies — The Famous King − Man + Woman = Queen

One of the most striking demonstrations of word embedding quality is that vector arithmetic captures semantic relationships:

Gender: king − man + woman ≈ queen
Capital cities: Paris − France + Germany ≈ Berlin
Verb tense: running − run + walk ≈ walking
Comparative: bigger − big + small ≈ smaller

This happens because embeddings encode relationships as directions in vector space. The “royalty” direction is similar for both male and female royals; the “gender” direction is consistent across many word pairs. Linear combinations of these directions produce the observed analogies.

9. Python Code


import gensim.downloader as api
from gensim.models import Word2Vec, FastText
import numpy as np

# --- Load pre-trained Word2Vec (Google News) ---
# Note: Large download (~1.6GB) — use smaller model for testing
model_w2v = api.load('word2vec-google-news-300')

# Word similarity
print("Similarity (cat, kitten):", model_w2v.similarity('cat', 'kitten'))
print("Similarity (cat, dog):",    model_w2v.similarity('cat', 'dog'))
print("Similarity (cat, car):",    model_w2v.similarity('cat', 'car'))

# Most similar words
print("\nMost similar to 'engineering':")
print(model_w2v.most_similar('engineering', topn=5))

# Word analogy: king - man + woman = ?
result = model_w2v.most_similar(positive=['king', 'woman'], negative=['man'])
print(f"\nking - man + woman = {result[0][0]}")

# --- Load pre-trained GloVe (smaller, faster) ---
model_glove = api.load('glove-wiki-gigaword-100')
print("\nGloVe similarity (Paris, London):", model_glove.similarity('paris', 'london'))

# --- Train Word2Vec from scratch ---
sentences = [
    ["machine", "learning", "is", "powerful"],
    ["deep", "learning", "uses", "neural", "networks"],
    ["machine", "learning", "and", "deep", "learning", "are", "related"]
]
custom_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, epochs=100)
print("\nCustom model vector for 'learning':", custom_model.wv['learning'][:5], "...")

# --- FastText for OOV words ---
ft_model = FastText(sentences, vector_size=100, window=5, min_count=1, epochs=100)
# FastText can handle words not in training vocabulary
print("FastText OOV vector for 'learningg' (typo):", ft_model.wv['learningg'][:5], "...")

10. Frequently Asked Questions

Should I train my own embeddings or use pre-trained ones?

Use pre-trained embeddings unless you have a very large domain-specific corpus (millions of documents) and the domain is significantly different from general text (e.g., medical jargon, legal text, code). Pre-trained GloVe or FastText vectors work excellently as starting points and can be fine-tuned on your task. Training from scratch requires massive data and compute that most projects do not have.

Are word embeddings still relevant with transformers?

Traditional static word embeddings (Word2Vec, GloVe) have largely been replaced by contextual embeddings from transformers (BERT, GPT) for state-of-the-art NLP. However, static embeddings are still widely used in production for: low-latency applications (they are much faster), resource-constrained environments (smaller models), and as features for simple ML models. Understanding word embeddings also provides the conceptual foundation for understanding transformer embeddings.

Word Embeddings

Word Embeddings

📌 Key Takeaways

1. The Problem with One-Hot Encoding

2. What are Word Embeddings?

3. Word2Vec — CBOW & Skip-gram

CBOW — Continuous Bag of Words

Skip-gram

4. GloVe — Global Vectors for Word Representation

5. FastText

6. Comparison Table

7. Cosine Similarity

8. Word Analogies — The Famous King − Man + Woman = Queen

9. Python Code

10. Frequently Asked Questions

Should I train my own embeddings or use pre-trained ones?

Are word embeddings still relevant with transformers?

Next Steps

Next Steps

Leave a Comment