In half I, I wrote about precision points in English. In case you loved that, you could need to cease studying now, as a result of I’m about to enter the technical particulars. Really, these particulars are fairly attention-grabbing.
For example, I supplied the next components for calculating error as a consequence of float precision:
maximum_error = 2-24 X
I later talked about that the components is an approximation, and mentioned that the true components is,
maximum_error = 2-24 2flooring(log2 X)
I didn’t clarify how I obtained both components.
I should be extra exact as we speak than I used to be in my earlier posting. For example, I beforehand used x for 2 ideas, the true worth and the rounded-after-storage worth. Right now I want to differentiate these ideas.
X is the true worth.
x is the worth after rounding as a consequence of storage.
The problem is the distinction between x and X when X is saved in 24-binary-digit float precision.
Base 10
Though I harp on the worth of studying to assume in binary and hexadecimal, I admit that I, too, discover it simpler to assume in base 10. So let’s begin that method.
Say we report numbers to 2 digits of accuracy, which I’ll name d=2. Examples of d=2 numbers embody
1.0 1.6 12 47 52*10^1 (i.e, 520, however with solely two important digits)
To say that we report numbers to 2 digits of accuracy is to say that, coming upon the recorded no 1, we all know solely that the quantity lies between 0.95 and 1.05; or coming upon 12, that the true quantity lies between 11.5 and 12.5, and so forth. I assume that numbers are rounded effectively, which is to say, saved values report midpoints of intervals.
Earlier than we get into the maths, let me observe that the majority us could be keen to say that numbers recorded this fashion are correct to 1 half in 10 or, if d=3, to 1 half in 100. If numbers are correct to 1 half in 10^(d-1), then couldn’t we should multiply the quantity by 1/(10^(d-1)) to acquire the width of the interval? Let’s attempt:
Assume X=520 and d=2. Then 520/(10^(2-1)) = 52. The true interval, nonetheless, is (515, 525] and it has width 10. So the easy components doesn’t work.
The easy components doesn’t work but I offered its base-2 equal in Half 1 and I even really useful its use! We’ll get to that. It seems the smaller the bottom, the extra precisely the easy components approximates the true components, however earlier than I can present that, I want the true components.
Let’s begin by eager about d=1.
- The recorded quantity 0 will include all numbers between [-0.5, 0.5). The recorded number 1 will contain all numbers between [0.5, 1.5), and so on. For 0, 1, …, 9, the width of the intervals is 1.
- The recorded number 10 will contain all numbers between [5, 15). The recorded number 20 will contain all numbers between [15, 25), and so on, For 10, 20, …, 90, the width of the intervals is 10.
The derivation for the width of interval goes like this:
- If we recorded the value of X to one decimal digit, the recorded digit will will be b, the recorded value will be x = b*10p, and the power of ten will be p = floor(log10X). More importantly, W1 = 10p will be the width of the interval containing X.
- It therefore follows that if we recorded the value of X to two decimal digits, the interval length would be W2 = W1/10. What ever the width with one digit, adding another must reduce width by one-tenth.
- If we recorded the value of X to three decimal digits, the interval length would be W3 = W2/10.
- Thus, if d is the number of digits to which numbers are recorded, the width of the interval is 10p where p = floor(log10X) – (d-1).
The above formula is exact.
Base 2
Converting the formula
interval_width = 10floor(log10X)-(d-1)
from base 10 to base 2 is easy enough:
interval_width = 2floor(log2X)-(d-1)
In Part 1, I presented this formula for d=24 as
maximum_error = 2floor(log2X)-24 = 2 -24 2floor(log2 X)
In interval_width, it is d-1 and not d that appears in the formula. You might think I made an error and should have put -23 where I put -24 in the maximum_error formula. There is no mistake. In Part 1, the maximum error was defined as a plus-or-minus quantity and is thus half the width of the overall interval. So I divided by 2, and in effect, I did put -23 into the maximum_error formula, at least before I subracted one more from it, making it -24 again.
I started out this posting by considering and dismissing the base-10 approximation formula
interval_width = 10-(d-1) X
which in maximum-error units is
maximum_error = 10-d X
and yet in Part 1, I presented — and even recommended — its base-2, d=24 equivalent,
maximum_error = 2-24 X
It turns out that the approximation formula is not as inaccurate in base 2 and it would be in base 10. The correct formula,
maximum_error = 2floor(log2X)-d
can be written
maximum_error = 2-d 2floor(log2X
so the question becomes about the accuracy of substituting X for 2^floor(log2X). We know by examination that X ≥ 2^floor(log2X), so making the substitution will overstate the error and, in that sense, is a safe thing to do. The question becomes how much the error is overstated.
X can be written 2^(log2X) and thus we need to compare 2^(log2X) with 2^floor(log2X). The floor() function cannot reduce its argument by more than 1, and thus 2^(log2X) cannot differ from 2^floor(log2X) by more than a factor of 2. Under the circumstances, this seems a reasonable approximation.
In the case of base 10, the the floor() function reducing its argument by up to 1 results in a decrease of up to a factor of 10. That, it seems to me, is not a reasonable amount of error.
