From Word2Vec to Embeddings: How Text Became Vectors

For most of computing history, a machine had no notion that "king" and "queen" were related β€” to a program they were just two different strings, as alike or unlike as "king" and "xylophone." The standard representation, one-hot encoding, made this explicit and absurd: every word was a vector of all zeros with a single 1 in its own slot, so every pair of words was exactly equidistant and orthogonal. You could store text, search it, count it β€” but you couldn't compute meaning. The idea that fixed that, and quietly set up everything happening in NLP right now, is the embedding: representing a word as a dense vector where geometry encodes meaning.

This is a foundations piece, written from late 2018 as contextual models are just arriving. I'll trace the line from the one insight that makes it all work β€” the distributional hypothesis β€” through word2vec and GloVe, the surprising vector arithmetic they enabled, the hard limits of "one vector per word," and the contextual turn that's beginning. It's the on-ramp to the Transformer and, later, to vector search and retrieval.

The distributional hypothesis

The whole field rests on a 1950s linguistics observation: "you shall know a word by the company it keeps." Words that appear in similar contexts tend to have similar meanings. "Coffee" and "tea" show up around the same neighbors β€” drink, cup, morning, hot β€” so whatever they mean, it's related. This is the distributional hypothesis, and its power is that it turns a philosophical problem (what does a word mean?) into a statistical one you can learn from raw text: look at the contexts a word appears in, and let those contexts define it.

That reframing is everything. It means meaning can be learned from unlabeled text β€” no dictionary, no annotation, just a large corpus β€” by training a model to predict a word's context. The vector that falls out of that training is the embedding.

Word2vec: predicting context

Word2vec (2013) made this practical and fast. It's a shallow neural network with a deceptively simple training task, in one of two flavors:

  • Skip-gram: given a word, predict the words around it. Show it "coffee" and it learns to predict "cup," "morning," "hot."
  • CBOW (continuous bag of words): the reverse β€” given the surrounding words, predict the missing center word.

You don't actually care about the prediction. You care about the weights the network learns along the way: after training on billions of words, each word's row in the weight matrix is its embedding β€” a dense vector of a few hundred numbers. Words that predict similar contexts end up with similar vectors, exactly as the distributional hypothesis promised. A clever trick called negative sampling made training tractable: instead of updating against the entire vocabulary every step (hopelessly expensive), the model just learns to tell the real context words from a handful of random "negative" words β€” turning an enormous classification into a cheap binary one.

graph LR
    ONEHOT["One-hot world:
every word orthogonal,
all pairs equidistant
(no meaning)"] TRAIN["Train on context
(skip-gram / CBOW
+ negative sampling)"] SPACE["Dense vector space:
'coffee' near 'tea',
'king' near 'queen' β€”
meaning becomes direction"] ONEHOT --> TRAIN --> SPACE

The shift word2vec made. One-hot vectors carry no relationships β€” every word is equally far from every other. Training a shallow network to predict context collapses words into a few-hundred-dimensional space where distance and direction encode meaning: similar words cluster, and relationships become consistent geometric offsets.

Meaning as direction: the analogy trick

The result that made word2vec famous is that the geometry isn't just about closeness β€” directions in the space carry meaning too. The canonical demonstration:

vec("king") βˆ’ vec("man") + vec("woman") β‰ˆ vec("queen")

The vector you get from "king" minus "man" plus "woman" lands closest to "queen." The offset from "man" to "woman" is roughly the same offset as "king" to "queen" β€” the model learned a "gender" direction without ever being told gender exists. Similar consistent offsets show up for capital-of-country, verb tense, and singular-plural. Nobody designed these axes; they emerged from co-occurrence statistics alone. That's the moment a lot of people, myself included, realized this was something deeper than a lookup table.

GloVe and fastText: variations on the theme

Two close relatives round out this generation. GloVe (2014) reaches similar embeddings from a different angle β€” instead of sliding a prediction window, it factorizes a global word co-occurrence matrix, baking in corpus-wide statistics directly. fastText (2016) adds a fix for a real weakness: it represents a word as the sum of its character n-grams (sub-word pieces), so it can build a reasonable vector for a word it never saw in training (a typo, a rare inflection) by composing the pieces β€” something word2vec, which only knows whole words, simply can't do.

The limit that breaks everything: one vector per word

Here's the wall this generation hits, and it's fundamental. These embeddings are static: each word gets exactly one vector, no matter how it's used. But words are polysemous. Consider:

  • "I sat on the bank of the river."
  • "I deposited the check at the bank."

Word2vec gives "bank" a single vector β€” an awkward blur of both senses, anchored wherever the training data leaned. It can't tell the river from the financial institution, because it never sees the sentence; it only ever saw the word. For any task where context decides meaning β€” which is most of language β€” a fixed per-word vector is a ceiling you can't break by adding more data.

This static-vector limit is the seam the whole field is splitting along right now. If meaning depends on context, the embedding has to depend on context too β€” the vector for "bank" should be computed for this sentence, not looked up from a table. That's the entire premise of the contextual models arriving as I write this, and it's why the next chapter of NLP isn't "bigger word vectors" but "vectors that change with their neighbors."

The contextual turn

The answer, just landing, is contextual embeddings: instead of one fixed vector per word, a model reads the whole sentence and produces a vector for each word in that context. ELMo (early 2018) did this with deep bidirectional LSTMs, and the gains across NLP tasks were large enough that it was obvious the direction was right. The architecture that's about to make this dramatically more effective β€” by replacing recurrence with attention so a model can weigh every other word directly β€” is the Transformer, and the pretrain-then-fine-tune models built on it are the subject of the next piece.

But notice what carries forward unchanged: the core idea that text becomes vectors and meaning is geometry. Contextual models produce better, context-aware vectors β€” they don't abandon the embedding. And that same geometry, applied to whole documents rather than single words, is exactly what later powers semantic search and retrieval: embed a query and your documents into the same space, and "relevant" becomes "nearby," measured by cosine distance.

What to carry away

Embeddings turned text from opaque strings into geometry. The distributional hypothesis β€” a word is defined by its company β€” let meaning be learned from raw text; word2vec (skip-gram/CBOW with negative sampling) and GloVe turned that into dense vectors where similar words cluster and relationships become consistent directions, famously enough that king βˆ’ man + woman lands near queen. fastText added sub-word robustness. The hard limit is that these vectors are static β€” one per word, blind to context β€” which is precisely the constraint contextual embeddings (ELMo now, Transformers next) are breaking.

If you only remember one thing: representing things as vectors in a learned space, where distance means similarity, is the durable idea β€” far more durable than any one model. It's the foundation under the Transformer and, years on, under every retrieval system that finds meaning by finding what's nearby.