attention math

Attention math

Why $frac{1}{\sqrt(dk)}$

X, Y are two independent random variable. E[X] = 0, E[Y] = 0. Var[X] = 1, Var[Y] = 1.

$Var[XY] = E[X]^2 \dot Var[Y] + E[Y]^2 \dot Var[X] + Var[X] \dot Var[Y]$

Since E[x] and E[Y] = 0, $Var[XY] = Var[X] * Var[Y]$

For $C = Q * K^T$, $C_{ij} = \sum_{k=1}^{N} Q_{ik} * K_{kj}$. Each entry is a sum of N products of random variables.

$Var[QK^T] = dk * 1 * 1$, where N = dk, Var[Q] = 1, Var[k] = 1.

We want to find a value to scale the $QK^T$ so that Var[a \dot QK^T] = 1$

$Var[aX] = a^2 * Var[X]$, $Var[aQK^T] = a^2 * Var[QK^T] = a^2 * dk = 1$. $a = \frac{1}{\sqrt(dk)}$




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Sharpen the Spec, Cut the Code A Case for Generative File System with SYSSPEC
  • perftracker
  • Agentic Context Engineering
  • GEPA
  • MDP