Understanding Phrase Embeddings (1) – Algebra

October 18, 2025

5

A while again I took the time to clarify that matrix multiplication might be seen as a linear transformation. Having that perspective helps to know the inner-workings of all AI fashions throughout varied domains (audio, photographs, and so on.). Constructing on that, these subsequent couple of posts will assist you perceive the inputs utilized in these matrix multiplication operations, particularly for individuals who wish to perceive how text-based fashions and LLMs perform. Our focus is on the notorious one-hot encoding, as it’s the key to unlock the underlying idea. It can present you, I hope, the often-illusive instinct behind word-embeddings.

I take advantage of the next easy instance sentences:

S1: “The canine chased the cat”

S2: “The cat watched the hen”

For simplicity I take advantage of the time period ‘phrases’ however in follow we use the extra common time period ‘tokens’ which may stand for something, for instance a query marks or phrase fragments, reasonably than complete phrases. However we hold at it as if our tokens are simply complete phrases. The place to begin of feeding these texts to the machine the conversion to numerical values. Textual content to numbers conversion might be completed in few alternative ways, which could appear a bit complicated for newcomers. Here’s a abstract desk to verify we’re all discussing the identical factor.

Technique

Definition

Construction

Data Preserved

Instance

One-Sizzling Encoding

Every distinctive phrase as a sparse vector with a single “1”

Vector with size = vocabulary dimension

Identification solely

“cat” = [0,0,0,1,0,0]

Sequence of One-Sizzling Vectors

Sentence as a sequence of one-hot vectors

Matrix, however could possibly be flatten

Identification and order

S1 = [1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0], [1,0,0,0,0,0], [0,0,0,1,0,0]

Bag-of-Phrases

Counts phrase frequencies in a single doc

Vector with size = vocabulary dimension

Frequencies, with out order

S1 = [2,1,1,1,0,0]
S2 = [2,0,0,1,1,1]

Time period-Doc Matrix

columns- or rows-concatenated of bag-of-words vectors for a number of paperwork or sentences

Matrix: vocabulary dimension × paperwork

Phrase frequencies throughout corpus

Phrase	S1	S2
the	2	2
canine	1	0
chased	1	0
cat	1	1
watched	0	1
hen	0	1

As talked about, our focus right here is on utilizing one-hot encoding (first row within the desk above) the dimensions of the whole one-hot matrix is:Whole Phrases × Distinctive Phrases. The place:

Whole Phrases = The sum of all phrase occurrences throughout all paperwork/sentences
Distinctive Phrases = The variety of completely different phrases within the vocabulary

Within the one-hot matrix every row corresponds to a single phrase incidence within the corpus, every column corresponds to a singular phrase within the vocabulary, and every row incorporates precisely one “1” with zeros elsewhere. Because the corpus grows, the matrix turns into very massive and more and more sparse, which is why one-hot encoding in itself is often a no-go for direct storage and manipulation in large-scale functions. Nevertheless it’s essential for greedy the ideas we’ll construct on later. The next desk illustrates how sparsity grows with dimensionality:

Corpus Measurement	Whole Phrases	Distinctive Phrases	Matrix Measurement	Non-Zero Parts	Non-Zero Proportion
Small	10	7	10×7 = 70	10	14.3%
Medium	1,000	200	1,000×200 = 200,000	1,000	0.5%
Giant	1,000,000	100,000	= 100 billion	$10^6$	0.001%

Whereas in itself the one-hot matrix will not be tremendous sensible, all phrase embeddings originate from that one-hot matrix. Heads up: Algebra forward 📐!

Why is the one-hot matrix the muse for embedding Studying?

The one-hot matrix is a foundation within the algebraic sense. I first clarify the which means of foundation, then we see why it’s related for phrase embeddings. Foundation within the algebraic sense implies that:

The vectors in that matrix are linearly impartial
Some other vector you possibly can consider might be expressed as a linear mixture of the one-hot vectors. Extra formally, the one-hot matrix spans the vector house.

In much less summary phrases you possibly can consider the one-hot vectors as colours. Think about your colours as a “palette.” A whole palette, able to making any colour, has these properties: Completeness – it “spans” all colours, and non-redundancy – its colours are “impartial”. Crimson, yellow, blue, white, and black are full and non-redundant. In distinction: Crimson, blue, and white are incomplete, lacking colours like yellow and inexperienced. The set: pink, orange, yellow, inexperienced, blue, purple, white, and black has some redundancies, for instance purple might be created utilizing a combination of pink and blue.

Returning to our ordinary formal perspective:

Linear dependence means one vector might be precisely written as a mixture of others. To state that two vectors are linearly dependent is like saying: “one vector is fully predictable from the others,” like an ideal correlation. When vectors are linearly impartial, no vector is “explaining” or “duplicating” one other – just like uncorrelated variables. To indicate that one-hot vectors are linearly impartial we use the zero vector (the place all entries are zero). The zero vector might be created by a linear mixture of any vectors, how? you trivially set all of the weightscoefficients of the opposite vectors to zero. So, if the one manner we are able to get the zero vector is by setting all the weights to zero it implies that no vector might be “cancelled” by one other vector, which means there isn’t any redundancy or put in a different way there isn’t any linear dependence.

Instance

Take these three one-hot vectors in ℝ³:
- v₁ = [1, 0, 0]
- v₂ = [0, 1, 0]
- v₃ = [0, 0, 1]
And take a look at for:

a v₁ + b v₂ + c v₃ = [0, 0, 0]
a[1, 0, 0] + b[0, 1, 0] + c[0, 0, 1] = [a, b, c] = [0, 0, 0]

For the above to carry there isn’t any different manner however:

The solely solution to make the zero vector is by making all coefficients zero. So there’s your linear independence, now let’s transfer on to spanning.
For a set of vectors to span the vector house, you have to show that each vector in that house might be written as a linear mixture of the given set. The usual foundation vectors (the one-hot vectors in our case) ${e_1, e_2, dots, e_n}$ span $mathbb{R}^n$ ( $mathbb{R}^n$ means a vector of actual variety of dimension $n$ the place $n$ must be regarded as the dimensions of the vocabulary) as a result of any vector $x in mathbb{R}^n$ , the place $x = [x_1, x_2, dots, x_n]^T$ , might be written as a linear mixture of the idea vectors. Right here is why:

Outline $a_i = x_i$ for every $i = 1, dots, n$ . Then:

$[ x = sum_{i=1}^{n} a_i e_i = sum_{i=1}^{n} x_i e_i ]$

In phrases, you merely set the $a_i$ vectors such that their entries are all zeros besides the entry wanted to assemble the $x_i$ vector precisely; as a result of whenever you multiply that entry $x_i$ with 1, you get $x_i$ . This reveals that $x$ might be expressed as a linear mixture of ${e_1, dots, e_n}$ , so that they span $mathbb{R}^n$ . That means: all and any embedding vectors (even in decrease dimensional house) are linear mixtures of the one-hot vectors.

Whereas one-hot encoding isn’t sensible for real-world use its algebraic understanding makes it why word-embeddings really works, and why it’s theoretically legit to shift from that giant, discrete and sparse house to a smaller, dense, steady, and semantically related house.

With algebra lined, the subsequent put up will discover the geometric interpretation of phrase embeddings.

Understanding Phrase Embeddings (1) – Algebra

Why is the one-hot matrix the muse for embedding Studying?

Instance

Related Articles

Guillain-Barré Syndrome: Clusters not contagion

Econometrics Puzzler #1: To Instrument or Not?

10 Greatest Chairs for Programming in India 2025

LEAVE A REPLY Cancel reply

Latest Articles

Guillain-Barré Syndrome: Clusters not contagion

Econometrics Puzzler #1: To Instrument or Not?

10 Greatest Chairs for Programming in India 2025

5 Sensible Examples for ChatGPT Brokers

Constructing Belief in AI: Enabling Companies to Strategize an Moral AI Future