Thursday, April 23, 2026

The way to learn the %21x format, half 2


In my earlier posting final week, I defined how computer systems retailer binary floating-point numbers, how Stata’s %21x show format shows with constancy these binary floating-point numbers, how %21x may also help you uncover bugs, and the way %21x may also help you perceive behaviors that aren’t bugs although they’re surpising to us base-10 thinkers. The purpose is, it’s typically helpful to assume in binary, and with %21x, considering in binary isn’t troublesome.

This week, I wish to focus on double versus float precision.

Double (8-byte) precision offers 53 binary digits. Float (4-byte) precision offers 24. Let me present you what float precision seems like.


. show %21x sqrt(2) _newline %21x float(sqrt(2))
+1.6a09e667f3bcdX+000
+1.6a09e60000000X+000

All these zeros within the floating-point consequence should not actually there;
%21x merely padded them on. The show can be extra sincere if it had been


+1.6a09e6       X+000

After all, +1.6a09e60000000X+000 is a superbly legitimate means of writing +1.6a09e6X+000 — simply as 1.000 is a sound means of writing 1 — however you will need to keep in mind that float has fewer digits than double.

Hexadecimal 1.6109e6 is a rounded model of 1.6a09e667f3bcd, and you may consider this in certainly one of two methods:


     double     =  float   + additional precision
1.6a09e667f3bcd = 1.6a09e6 + 0.00000067f3bcd

or


  float   =      double     -  misplaced precision
1.6a09e6  = 1.6a09e667f3bcd - 0.00000067f3bcd

Word that extra digits are misplaced than seem within the float consequence! The float consequence offers six hexadecimal digits (ignoring the 1), and 7 digits seem below the heading misplaced precision. Double precision is greater than twice float precision. To be correct, double precision offers 53 binary digits, float offers 24, so double precision is absolutely 53/24 = 2.208333 precision.

The double of double precision refers back to the whole variety of binary digits used to retailer the mantissa and the exponent in z=a*2^b, which is 64 versus 32. Precision is 53 versus 24.

On this case, we obtained the floating-point from float(sqrt(2)), that means that we rounded a extra correct double-precision consequence. One often rounds when producing a much less exact illustration. One of many rounding guidelines is to spherical up if the digits being omitted (with a decimal level in entrance) exceed 1/2, that means 0.5 in decimal. The equal rule in base-16 is to spherical up if the digits being omitted (with a hexadecimal level in entrance) exceed 1/2, that means 0.8 (base-16). The misplaced digits had been .67f3bcd, that are lower than 0.8, and subsequently, the final digit of the rounded consequence was not adjusted.

Really, rounding to drift precision is tougher than I make out, and seeing that numbers are rounded accurately when displayed in %21x will be troublesome. These difficulties must do with the connection between base-2 — the bottom during which the pc works — and base-16 — a base related however not equivalent to base-2 that we people discover extra readable. The very fact is that %21x was designed for double precision, so it solely does an enough job of exhibiting single precision. When %21x shows a float-precision quantity, it reveals you the precisely equal double-precision quantity, and that seems to matter.

We use base-16 as a result of it’s simpler to learn. However why will we use base-16 and never base-15 or base-17? We use base-16 as a result of it’s an integer energy of two, the bottom the pc makes use of. One benefit of bases being powers of one another is that base conversion will be accomplished extra simply. In actual fact, conversion will be accomplished nearly digit by digit. Doing base conversion is often a tedious course of. Strive changing 2394 (base-10) to base-11. Effectively, you say, 11^3=1331, and a pair of*11331 = 2662 > 2394, so the primary digit is 1 and the rest is 2394-1331 = 1063. Now, repeating the method with 1063, I observe that 11^2 = 121 and that 1063 is sure by 8*121=969 and 9*121=1089, so the second digit is 9 and I’ve a the rest of …. And ultimately you produce the reply 1887 (base-11).

Changing between bases when one is the ability of one other not solely is less complicated but in addition is very easy you are able to do it in your head. To transform from base-2 to base-16, group the binary digits into teams of 4 (as a result of 2^4=16) after which translate every group individually.

As an example, to transform 011110100010, proceed as follows:


0111 1010 0010
--------------
   7    a    2

I’ve carried out this course of typically sufficient that I hardly must assume. However right here is how you must assume: Divide the binary quantity into four-digit teams. The 4 columns of the binary quantity stand for 8, 4, 2, and 1. If you have a look at 0111, say to your self 4+2+1 = 7. If you have a look at 1010, say to your self 8+2 = 10, and keep in mind that the digit for 10 in base-16 is a.

Changing again is almost as simple:


   7    a    2
--------------
0111 1010 0010

Take a look at 7 and keep in mind the binary columns 8-4-2-1. Although 7 doesn’t include an 8, it does include a 4 (leaving 3), and three incorporates a 2 and a 1.

I admit that changing base-16 to base-2 is extra tedious than changing base-2 to base-16, however ultimately, you’ll have the four-digit binary desk memorized; there are solely 16 traces. Say 7 to me, and 0111 simply pops into my head. Effectively, I’ve been doing this a very long time, and anyway, I’m a geek. I believe I carry the as-yet-undiscovered binary gene, which suggests I got here into this world with the base-2-to-base-16 conversion desk hardwired:

base-2 base-16
0000 0
0001 1
0010 2
0011 3
0100 4
1001 9
1010 a
1111 f

Now you could convert base-2 to base-16 — convert from binary to hexadecimal — and you may convert again once more, let’s return to floating-point numbers.

Keep in mind how floating-point numbers are saved:

z = a * 2^b, 1<=a<2 or a==0

For instance,


    0.0 = 0.0000000000000000000000000000000000000000000000000000 * 2^-large
    0.5 = 1.0000000000000000000000000000000000000000000000000000 * 2^-1
    1.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
sqrt(2) = 1.0110101000001001111001100110011111110011101111001101 * 2^0
    1.5 = 1.1000000000000000000000000000000000000000000000000000 * 2^0
    2.0 = 1.0000000000000000000000000000000000000000000000000000 * 2^0
    2.5 = 1.0100000000000000000000000000000000000000000000000000 * 2^0
    3.0 = 1.1000000000000000000000000000000000000000000000000000 * 2^1
    _pi = 1.1001001000011111101101010100010001000010110100011000 * 2^1
    and so on.

In double precision, there are 53 binary digits of precision. One of many digits is written to the left of binary level, and the remaining 52 are written to the best. Subsequent observe that the 52 binary digits to the best of the binary level will be written in 52/4=13 hexadecimal digits. That’s precisely what %21x does:


    0.0 = +0.0000000000000X-3ff
    0.5 = +1.0000000000000X-001
    1.0 = +1.0000000000000X+000
sqrt(2) = +1.6a09e667f3bcdX+000
    1.0 = +1.0000000000000X+000
    1.5 = +1.8000000000000X+000
    2.0 = +1.0000000000000X+001
    2.5 = +1.4000000000000X+001
    3.0 = +1.8000000000000X+002
    _pi = +1.921fb54442d18X+001

You may carry out the binary-to-hexadecimal translation for your self. Think about _pi. The primary group of 4 binary digits after the binary level are 1001, and 9 seems after the binary level within the %21x consequence. The second group of 4 are 0010, and a pair of seems within the %21x consequence.
The %21x result’s an actual illustration of the underlying binary, and thus you’re equally entitled to assume in both base.

In single precision, the rule is similar:

z = a * 2^b, 1<=a<2 or a==0

However this time, solely 24 binary digits are offered for a, and so now we have


    0.0 = 0.00000000000000000000000 * 2^-large
    0.5 = 1.00000000000000000000000 * 2^-1
    1.0 = 1.00000000000000000000000 * 2^0
sqrt(2) = 1.01101010000010011110011 * 2^0
    1.5 = 1.10000000000000000000000 * 2^0
    2.0 = 1.00000000000000000000000 * 2^0 
    2.5 = 1.01000000000000000000000 * 2^0
    3.0 = 1.10000000000000000000000 * 2^1
    _pi = 1.10010010000111111011011 * 2^1
    and so on.

In single precision, there are 24-1=23 binary digits of precision to the best of the binary level, and 23 isn’t divisible by 4. If we tried to transform to base-16, we find yourself with


sqrt(2) = 1.0110 1010 0000 1001 1110 011   * 2^0
          1.   6    a    0    9    e    ?  * 2^0

To fill within the final digit, we may acknowledge that we are able to pad on an additional 0 as a result of we’re to the best of the binary level. For instance, 1.101 == 1.1010. If we padded on the additional 0, now we have


sqrt(2) = 1.0110 1010 0000 1001 1110 0110  * 2^0
          1.   6    a    0    9    e    6  * 2^0

That’s exactly the consequence %21x reveals us:


. show %21x float(sqrt(2))
+1.6a09e60000000X+000

though we’d want that %21x would omit the 0s that aren’t actually there, and as a substitute show this as +1.6a09e6X+000.

The issue with this answer is that it may be deceptive as a result of the final digit seems prefer it incorporates 4 binary digits when in actual fact it incorporates solely three. To indicate how simply you will be misled, have a look at _pi in double and float precisions:


. show %21x _pi _newline %21x float(_pi)
+1.921fb54442d18X+001
+1.921fb60000000X+001
        ^
  digit incorrectly rounded?

The pc rounded the final digit up from 5 to six. The digits after the rounded-up digit within the full-precision consequence, nonetheless, are 0.4442d18, and are clearly lower than 0.8 (1/2). Shouldn’t the rounded consequence be 1.921fb5X+001? The reply is that sure, 1.921fb5X+001 can be a greater consequence if we had 6*4=24 binary digits to the best of the binary level. However now we have solely 23 digits; accurately rounding to 23 binary digits after which translating into base-16 leads to 1.921fb6X+001. Due to the lacking binary digit, the final base-16 digit can solely tackle the values 0, 2, 4, 6, 8, a, c, and e.

The pc performs the rounding in binary. Take a look at the related piece of this double-precision quantity in binary:


+1.921f   b    5    4    4    42d18X+001      quantity
       1011 0101 0100 0100 0100               enlargement into binary
       1011 01?x xxxx xxxx xxxxxxxx           desirous about rounding
       1011 011x xxxx xxxx xxxxxxxx           performing rounding
+1.921f   b    6                   X+001      convert to base-16

The half I’ve transformed to binary within the second line is across the half to be rounded. Within the third line, I’ve put x’s below the half we should discard to spherical this double right into a float. The x’d out half — 10100… — is clearly larger than 1/2, so the final digit (the place I put a query mark) have to be rounded up. Thus, _pi in float precision rounds to 1.921fb6+X001, simply as the pc stated.

Float precision doesn’t play a lot of a task in Stata even supposing most customers retailer their information as floats. No matter how information are saved, Stata makes all calculations in double precision, and float offers greater than sufficient precision for many information purposes. The U.S. deficit in 2011 is projected to be $1.5 trillion. One hopes {that a} grand whole of $26,624 — the error that will be launched by storing this projected deficit in float precision — wouldn’t be a major think about any lawmaker’s choice in regards to the subject. Folks within the U.S. are stated to work about 40 hours per week, or roughly 0.238 of the hours in per week. I doubt that quantity is correct to 0.4 milliseconds, the error that float would introduce in recording the fraction. A most cancers survivor would possibly reside 350.1 days after a therapy, however we’d introduce an error of roughly 1/2 second if we document the quantity as a float. One would possibly query whether or not the moment of loss of life may even conceptually be decided that precisely. The moon is claimed to be 384.401 thousand kilometers from the Earth. Document in 1,000s of kilometers in float, and the error is nearly 1 meter. At its closest and farthest, the moon is 356,400 and 406,700 kilometers away. Most basic constants of the universe are recognized solely to a couple elements in 1,000,000, which is to say, to lower than float precision, though we do know the velocity of sunshine in a vacuum to 1 decimal digit past float accuracy; it’s 299,793.458 kilometers per second. Spherical that to drift and also you’ll be off by 0.01 km/s.

The biggest integer that may be recorded with out rounding in float precision is 16,777,215. The biggest integer that may be recorded with out rounding in double precision is 9,007,199,254,740,991.

Folks working with dollar-and-cent information in Stata often discover it greatest to make use of doubles each to keep away from rounding points and in case the whole exceeds $167,772.15. Rounding problems with 0.01, 0.02, and so on., are inherent when working with binary floating level, no matter precision. To keep away from all issues, these individuals ought to use doubles and document quantities in pennies. That can haven’t any issue with sums as much as $90,071,992,547,409.91, which is to say, about $90 trillion. That’s 9 quadrillion pennies. In my childhood, I assumed a quadrillion simply meant lots, but it surely has a proper definition.

All of which is a great distance from the place I began, however now you’re an knowledgeable in understanding binary floating-point numbers the best way a scientific programmer wants to grasp them: z=a*2^b. You’re practically all the best way to understanding the IEEE 754-2008 normal. That normal merely states how a and b are packed into 32 and 64 bits, and your entire level of %21x is to keep away from these particulars as a result of, packed collectively, the numbers are unreadable by people.

References

Cox, N. J. 2006. Tip 33: Candy sixteen: Hexadecimal codecs and precision issues. Stata Journal 6: 282-283.

Gould, William. 2006. Mata issues: Precision. Stata Journal 6: 550-560.

Linhart, J. M. 2008. Mata issues: Overflow and IEEE floating-point format. Stata Journal 8: 255-268.



Related Articles

Latest Articles