Sunday, April 19, 2026

Gaussian distributed weights for LLMs


The earlier publish seemed on the FP4 4-bit floating level format. This publish will have a look at one other 4-bit floating level format, NF4, and better precision analogs. NF4 and FP4 are widespread bitsandbytes 4-bit information varieties. If you happen to obtain LLM weights from Hugging Face quantized to 4 bits, the weights could be in NF4 or FP4 format. Or possibly another format: there’s a shocking quantity of selection in how 4-bit numbers are applied.

Why NF4

LLM parameters have a roughly Gaussian distribution, and so evenly spaced numeric values should not ultimate for parameters. As an alternative, you’d like numbers which might be nearer collectively close to 0.

The FP4 floating level numbers, described within the earlier publish, are spaced 0.5 aside for small values, and the bigger values are spaced 1 or 2 aside. That’s hardly a Gaussian distribution, but it surely’s nearer to Gaussian than a uniform distribution can be. NF4 intentionally follows extra of a Gaussian distribution.

QLoRA

The QLoRA codecs [1], in contrast to FP4, should not analogs of IEEE numbers. The bits should not interpreted as signal, exponent, and mantissa, however quite as integers for use as indexes. An NFn quantity is an index into an inventory of twon actual numbers with Gaussian spacing. To place it one other manner, the numbers represented by NFn have uniformly distributed z-scores.

That is sensible at a excessive degree, however the paper [1] is tough to observe intimately. It says

Extra formally, we estimate the twookay values qi of the information sort as follows:

the place QX(·) is the quantile perform of the usual regular distribution N(0, 1).

The paper doesn’t give the vary of i but it surely says there are 2okay values, implying that i runs from 0 to 2okay −1 or from 1 to 2okay. Both manner runs into infinite values since Q(0) = −∞ and Q(1) = ∞. We may keep away from infinities by letting i run from 1 to 2n − 1.

The following sentence is puzzling.

An issue for a symmetric k-bit quantization is that this strategy doesn’t have a precise illustration of zero, which is a vital property to quantize padding and different zero-valued components with no error.

I perceive the need to symbolize 0 precisely, however the equation above has a precise illustration of 0 when i = 2n − 1. Maybe the authors had in thoughts that i takes on the values ½, 1 + ½, 2 + ½, …, 2n − ½. This might be cheap, however a extremely uncommon use of notation. Plainly the actual downside shouldn’t be the shortage of a illustration of 0 however an unused index, with i working from 1 to 2n − 1.

To be truthful, the primary sentence quoted above says “we estimate the twookay values …” and so the equation above is probably not supposed as a definition however as motivation for the precise definition.

Reproducing NF4

The authors give a process for utilizing 2n values of i and acquiring a precise illustration of 0, and so they give an inventory of NF4 values in Appendix E. I used to be not capable of get the 2 to match. I applied a couple of doable interpretations of the process described within the paper, and every approximates the listing of values within the appendix, however not carefully.

The next code, written with the assistance of ChatGPT, reverse engineers the NF4 values to eight decimal locations, i.e. to the precision of a 32-bit floating level quantity.


from scipy.stats import norm

Q = norm.ppf

α  = 0.9677083
Z  = Q(α)
δ1 = (α - 0.5)/7
δ2 = (α - 0.5)/8

q = [0]*16
for i in vary(7):
    q[i] = -Q(α - i*δ1)/Z
for i in vary(8):
    q[i+8] = Q(0.5 + (i+1)*δ2)/Z
    
# Values given in Appendix E
NF4 = [
    -1.0,
    -0.6961928009986877,
    -0.5250730514526367,
    -0.39491748809814453,
    -0.28444138169288635,
    -0.18477343022823334,
    -0.09105003625154495,
    0.0,
    0.07958029955625534,
    0.16093020141124725,
    0.24611230194568634,
    0.33791524171829224,
    0.44070982933044434,
    0.5626170039176941,
    0.7229568362236023,
    1.0
]

# Examine 
for i in vary(16):
    print(i, NF4[i] - q[i])

The magic quantity α = 0.9677083 is a thriller. I requested ChatGPT to look into this additional, and it mentioned that bitsandbytes makes use of α = 929/960 = 0.9677083333333333. Once I use this worth for α the precision is about the identical, which is ok. Nonetheless, the values within the paper got to 16 decimal locations, so I assumed it would be capable of match the values to extra precision.

Quibbles over the precise values of NF4 apart, the NF4 format works properly in follow. Fashions. quantized to 4 bits utilizing NF4 carry out higher than fashions quantized to different 4-bit codecs on some benchmarks.

Associated posts

[1] QLoRA: Environment friendly Finetuning of Quantized LLMs by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. https://arxiv.org/abs/2305.14314.

Related Articles

Latest Articles