Dot Product within the Consideration Mechanism

October 14, 2025

3

The dot product of two embedding vectors and $mathbf{y}$ with dimension $d$ is outlined as

$[mathbf{x} cdot mathbf{y} = x_1y_1 + x_2y_2 + dots + x_dy_d.]$

Hardly the very first thing that jumps to thoughts when occupied with a “similarity rating”. Certainly, the results of a dot product is a single numbers (a scalar), with no predefined vary (e.g. not between zero and one). So, it’s laborious to quantify whether or not a specific rating is excessive/low by itself. Nonetheless, deep studying Transformer household of fashions rely closely on the dot product within the consideration mechanism; to weigh the significance of various components of the enter sentence. This put up explains why the dot product which looks as if an odd choose as a “similarity scores”, really makes good sense.

Dot Product and Vector Similarity

I assume right here you already perceive the geometry of phrase embeddings.

Take into account the next: two vectors pointing the identical manner (0° angle), so we would like a most similarity rating; two vectors pointing in reverse methods (180° angle), so we would like a minimal (least comparable) rating, and two perpendicular vectors (90° angle), so we would like a zero similarity rating, that means no relationship.

If we simply used the angle itself, the numbers can be backward: 180° (least comparable) is the next quantity than 0° (most comparable). That’s under no circumstances intuitive for a “similarity” scale. Answer, feed the angle to the cosine perform. This offers us precisely the order we order:

$theta$ (levels)	$cos(theta)$
180	-1.000
150	-0.866
120	-0.500
90	0.000
60	0.500
30	0.866
0	1.000

The system to seek out the cosine of the angle between two vectors $mathbf {x}$ and $mathbf {y}$ of dimension $d$ is:

$[cos(theta )={mathbf {x} cdot mathbf {y} over |mathbf {x} ||mathbf {y} |}={frac {sum limits _{i=1}^{d}{x_{i}y_{i}}}{{sqrt {sum limits _{i=1}^{d}{x_{i}^{2}}}}{sqrt {sum limits _{i=1}^{d}{y_{i}^{2}}}}}}.]$

Manipulating the above only a bit to get our dot product buddy on the left-hand offers us:

now what’s $|mathbf{y}|cos(theta)$ ? It’s how carefully two vectors are pointing in the identical course multiplied by the magnitude of $mathbf{y}$ : $|mathbf{y}| = sqrt{y_1^2 + y_2^2 + cdots + y_d^2}$ . Or put in another way: how a lot of $mathbf{y}$ lies within the course of $mathbf{x}$ , in different phrases, if I challenge $mathbf{y}$ onto $mathbf{x}$ , how a lot is “captured”? (reply: precisely: $|mathbf {y} | cos(theta )$ ). Then, we multiply it by the magnitude of $mathbf{x}$ , $|mathbf {x} |$ in order to account for its magnitude as nicely.

Can a Vector Be Extra “Comparable” to Others Than to Itself???

For those who google self-attention matrix pictures you will note that the diagonal is kind of excessive, for instance: A description of the image that’s as a result of as we talked about, if the angle is zero, so the cosine is 1 and we boil all the way down to multiplying the magnitude of the 2 embedding vectors.

That mentioned, I acquired to occupied with this matter due to the statement that whereas a vector ought to theoretically be most just like itself, that’s not at all times true with among the pictures I noticed in papers or generated myself (diagonal just isn’t at all times the best scores as you’ll be able to see above determine as nicely). So it’s potential for various vectors to obtain larger similarity scores than a vector does to itself – even in self-attention matrices. It’s because the embeddings coordinates will not be at all times normalized to a constant size. So, even when two vectors level in nearly the identical course (a really small angle), the sheer “dimension” of certainly one of them can nonetheless make their similarity rating surprisingly excessive. Because of this I defined it that manner above.

By the way in which, that’s additionally the explanation for the normalizing issue $sqrt{d_k}$ . As a result of, As $d$ grows, the variability of the dot product grows; numbers there can turn into very giant due to the magnitude of the vectors themselves (bigger $d$ –> extra parts to sq. and sum). It’s not good to feed giant numbers to the softmax ( $e^{(cdot)..}$ ), as a result of it’s going to blow up, so earlier than we exponentiate ( $e^{(cdot)..}$ ), we “stabilize” that dot product by dividing it with $sqrt{d_k}$ . Why $sqrt{d_k}$ ? No actual motive. You may select one other time period if you want. The story is that the time period $sqrt{d_k}$ is the usual deviation of the dot product in the event you assume the embedding vectors are usually distributed (they aren’t, however this Gaussian assumption supplies an excellent excuse for utilizing $sqrt{d_k}$ ).

Dot Product within the Consideration Mechanism

Dot Product and Vector Similarity

Can a Vector Be Extra “Comparable” to Others Than to Itself???

Related Articles

What’s chikungunya? A information to the mosquito-borne virus

Simpler ARIMA Modeling with State House: Revisiting Inflation Modeling Utilizing TSMT 4.0

Skilled Swift | Kodeco

LEAVE A REPLY Cancel reply

Latest Articles

What’s chikungunya? A information to the mosquito-borne virus

Simpler ARIMA Modeling with State House: Revisiting Inflation Modeling Utilizing TSMT 4.0

Skilled Swift | Kodeco

Discrete Diffusion: Steady-Time Markov Chains

Sensible Actual Property Administration: Redefining Property Competitiveness