I wish to begin a collection on utilizing Stata’s random-number operate. Stata the truth is has ten random-number features:
- runiform() generates rectangularly (uniformly) distributed random quantity over [0,1).
- rbeta(a, b) generates beta-distribution beta(a, b) random numbers.
- rbinomial(n, p) generates binomial(n, p) random numbers, where n is the number of trials and p the probability of a success.
- rchi2(df) generates χ2 with df degrees of freedom random numbers.
- rgamma(a, b) generates Γ(a, b) random numbers, where a is the shape parameter and b, the scale parameter.
- rhypergeometric(N, K, n) generates hypergeometric random numbers, where N is the population size, K is the number of in the population having the attribute of interest, and n is the sample size.
- rnbinomial(n, p) generates negative binomial — the number of failures before the nth success — random numbers, where p is the probability of a success. (n can also be noninteger.)
- rnormal(μ, σ) generates Gaussian normal random numbers.
- rpoisson(m) generates Poisson(m) random numbers.
- rt(df) generates Student’s t(df) random numbers.
You already know that these random-number generators do not really produce random numbers; they produce pseudo-random numbers. This series is not about that, so we’ll be relaxed about calling them random-number generators.
You should already know that you can set the random-number seed before using the generators. That is not required but it is recommended. You set the seed not to obtain better random numbers, but to obtain reproducible random numbers. In fact, setting the seed too often can actually reduce the quality of the random numbers! If you don’t know that, then read help set seed in Stata. I should probably pull out the part about setting the seed too often, expand it, and turn it into a blog entry. Anyway, this series is not about that either.
This series is about the use of random-number generators to solve problems, just as most users usually use them. The series will provide practical advice. I’ll stay away from describing how they work internally, although long-time readers know that I won’t keep the promise. At least I’ll try to make sure that any technical details are things you really need to know. As a result, I probably won’t even get to write once that if this is the kind of thing that interests you, StataCorp would be delighted to have you join our development staff.
runiform(), generating uniformly distributed random numbers
Mostly I’m going to write about runiform() because runiform() can solve such a variety of problems. runiform() can be used to solve,
- shuffling data (putting observations in random order),
- drawing random samples without replacement (there’s a minor detail we’ll have to discuss because runiform() itself produces values drawn with replacement),
- drawing random samples with replacement (which is easier to do than most people realize),
- drawing stratified random samples (with or without replacement),
- manufacturing fictional data (something teachers, textbook authors, manual writers, and blog writers often need to do).
runiform() generates uniformly, a.k.a. rectangularly distributed, random numbers over the interval, I quote from the manual, “0 to nearly 1”.
Nearly 1? “Why not all the way to 1?” you should be asking. “And what exactly do you mean by nearly 1?”
The answer is that the generator is more useful if it omits 1 from the interval, and so we shaved just a little off. runiform() produces random numbers over [0, 0.999999999767169356].
Listed here are two helpful formulation it is best to decide to reminiscence.
- If you wish to generate steady random numbers between a and b, use
generate double u = (b–a)*runiform() + a
The random numbers won’t truly be between a and b, they are going to be between a and almost b, however the high will probably be so near b, specifically 0.999999999767169356*b, that it’s going to not matter.
Keep in mind to retailer steady random values as doubles.
- If you wish to generate integer random numbers between a and b, use
generate ui = flooring((b–a+1)*runiform() + a)
Specifically, don’t even think about using the formulation for steady values however rounded to integers, which is to say, spherical(u) = spherical((b–a)*runiform() + a). In case you use that formulation, and if b–a>1, then a and b will probably be underneath represented by 50% every within the samples you generate!
I saved ui as a default float, so I’m assuming that -16,777,216 ≤ a < b ≤ 16,777,216. When you have integers exterior of that vary, nevertheless, retailer as a lengthy or double.
I’m going to spend the remainder of this weblog entry explaining the above.
First, I wish to present you ways I obtained the 2 formulation and why it’s essential to use the second formulation for producing integer uniform deviates.
Then I would like clarify why we shaved just a little from the highest of runiform(), specifically (1) whereas it wouldn’t matter for formulation 1, it made formulation 2 just a little simpler, (2) the code would run extra shortly, (3) we may extra simply show that we had carried out the random-number generator appropriately, and (4) anybody digging deeper into our random numbers wouldn’t be misled into pondering they’d greater than 32 bits of decision. That final level will probably be necessary in a future weblog entry.
Steady uniforms over [a, b)
runiform() produces random numbers over [0, 1). It therefore obviously follows that (b–a)*runiform()+a produces number over [a, b). Substitute 0 for runiform() and the lower limit is obtained. Substitute 1 for runiform() and the upper limit is obtained.
I can tell you that in fact, runiform() produces random numbers over [0, (232-1)/232].
Thus (b–a)*runiform()+a produces random numbers over [a, ((232-1)/232)*b].
(232-1)/232) roughly equals 0.999999999767169356 and precisely equals 1.fffffffeX-01 if you’ll permit me to make use of %21x format, which Stata understands and which you’ll be able to perceive if you happen to see my earlier weblog posting on precision.
Thus, if you’re involved about outcomes being within the interval [a, b) rather than [a, b], you should use the formulation
generate double u = ((b–a)*runiform() + a) / 1.fffffffeX-01
There are seven f’s adopted by e within the hexadecimal fixed. Alternatively, you may kind
generate double u = ((b–a)*runiform() + a) * ((2^32-1)/2^32)
however multiplying by 1.fffffffeX-01 is much less typing so I’d kind that. Really I wouldn’t kind both one; the small distinction between values mendacity in [a, b) or [a, b] is unimportant.
Integer uniforms over [a, b]
Whether or not we produce actual, steady random numbers over [a, b) or [a, b] could also be unimportant, but when we wish to draw random integers, the excellence is necessary.
runiform() produces steady outcomes over [0, 1).
(b–a)*runiform()+a produces continuous results over [a, b).
To produce integer results, we might round continuous results over segments of the number line:
a a+.5 a+1 a+1.5 a+2 a+2.5 b-1.5 b-1 b-.5 b
real line +-----+-----+-----+-----+-----+-----------+-----+-----+-----+
int line |<-a->|<---a+1--->|<---a+2--->| |<---b-1--->|<-b->|
In the diagram above, think of the numbers being produced by the continuous formula u=(b–a)*runiform()+a as being arrayed along the real line. Then imagine rounding those values, say by using Stata’s round(u) function. If you rounded in that way, then
- Values of u between a and a+0.5 will be rounded to a.
- Values of u between a+0.5 and a+1.5 will be rounded to a+1.
- Values of u between a+1.5 and a+2.5 will be rounded to a+2.
- …
- Values of u between b-1.5 and b-0.5 will be rounded to b-1.
- Values of u between b-0.5 and b-1 will be rounded to b.
Note that the width of the first and last intervals is half that of the other intervals. Given that u follows the rectangular distribution, we thus expect half as many values rounded to a and to b as to a+1 or a+2 or … or b-1.
And indeed, that is exactly what we would see:
. set obs 100000
obs was 0, now 100000
. gen double u = (5-1)*runiform() + 1
. gen i = round(u)
. summarize u i
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
u | 100000 3.005933 1.156486 1.000012 4.999983
i | 100000 3.00489 1.225757 1 5
. tabulate i
i | Freq. Percent Cum.
------------+-----------------------------------
1 | 12,525 12.53 12.53
2 | 24,785 24.79 37.31
3 | 24,886 24.89 62.20
4 | 25,284 25.28 87.48
5 | 12,520 12.52 100.00
------------+-----------------------------------
Total | 100,000 100.00
To avoid the problem we need to make the widths of all the intervals equal, and that is what the formula floor((b–a+1)*runiform() + a) does.
a a+1 a+2 b-1 b b+1
real line +-----+-----+-----+-----+-----------------------+-----+-----+-----+-----+
int line |<--- a --->|<-- a+1 -->| |<-- b-1 -->|<--- b --->)
Our intervals are of equal width and thus we expect to see roughly the same number of observations in each:
. gen better = floor((5-1+1)*runiform() + 1)
. tabulate better
better | Freq. Percent Cum.
------------+-----------------------------------
1 | 19,808 19.81 19.81
2 | 20,025 20.02 39.83
3 | 19,963 19.96 59.80
4 | 20,051 20.05 79.85
5 | 20,153 20.15 100.00
------------+-----------------------------------
Total | 100,000 100.00
So now you know why we shaved a little off the top when we implemented runiform(); it made the formula
floor((b–a+1)*runiform() + a):
easier. Our integer [a, b] formulation didn’t need to concern itself that runiform() would typically — not often — return 1. If runiform() did return the occasional 1, the straightforward formulation above would produce the (correspondingly occasional) b+1.
How Stata calculates steady random numbers
I’ve stated that we shaved just a little off the highest, however the truth was that it was simpler for us to do the shaving than not.
runiform() is predicated on the KISS random quantity generator. KISS produces 32-bit integers, which means integers the vary [0, 232-1], or [0, 4,294,967,295]. You would possibly marvel how we transformed that vary to being steady over [0, 1).
Start by thinking of the number KISS produces in its binary form:
b31b30b29b28b27b26b25b24b23b22b21b20b19b18b17b16b15b14b13b12b11b10b9b8b7b6b5b4b3b2b1b0
The corresponding integer is b31*231 + b31*230 + … + b0*20. All we did was insert a binary point out front:
0 . b31b30b29b28b27b26b25b24b23b22b21b20b19b18b17b16b15b14b13b12b11b10b9b8b7b6b5b4b3b2b1b0
making the real value b31*2-1 + b30*2-2 + … + b0*2-32. Doing that is equivalent to dividing by 2-32, except insertion of the binary point is faster. Nonetheless, if we had wanted runiform() to produce numbers over [0, 1], we may have divided by 232-1.
Anyway, if the KISS random quantity generator produced 3190625931, which in binary is
10111110001011010001011010001011
we transformed that to
0.10111110001011010001011010001011
which equals 0.74287549 in base 10.
The most important quantity the KISS random quantity generator can produce is, after all,
11111111111111111111111111111111
and 0.11111111111111111111111111111111 equals 0.999999999767169356 in base 10. Thus, the runiform() implementation of KISS generates random numbers within the vary [0, 0.999999999767169356].
I may have introduced all of this mathematically in base 10: KISS produces integers within the vary [0, 232-1], and in runiform() we divide by 232 to thus produce steady numbers over the vary [0, (232-1)/232]. I may have stated that, but it surely loses the flavour and instinct of my longer clarification, and it might gloss over the truth that we simply inserted the binary level. If I requested you, a base-10 person, to divide 232 by 10, you wouldn’t truly divide in the identical manner that they might divide by, say 9. Dividing by 9 is figure. Dividing by 10 merely requires shifting the decimal level. 232 divided by 10 is clearly 23.2. You might not have realized that trendy digital computer systems, when programmed by “superior” programmers, comply with comparable procedures.
Oh gosh, I do get to say it! If this form of factor pursuits you, think about a profession at StataCorp. We’d like to have you ever.
Is it necessary that runiform() values be saved as doubles?
Generally it can be crucial. It’s clearly not necessary when you find yourself producing random integers utilizing flooring((b–a+1)*runiform() + a) and -16,777,216 ≤ a < b ≤ 16,777,216. Integers in that vary match right into a float with out rounding.
When creating steady values, keep in mind that runiform() produces 32 bits. floats retailer 23 bits and doubles retailer 52, so if you happen to retailer the results of runiform() as a float, it will likely be rounded. Generally the rounding issues, and typically it doesn’t. Subsequent time, we’ll talk about drawing random samples with out substitute. In that case, the rounding issues. In most different instances, together with drawing random samples with substitute — one thing else for later — the rounding
doesn’t matter. Moderately than pondering onerous in regards to the situation, I retailer all my non-integer
random values as doubles.
Tune in for the subsequent episode
Sure, please do tune in for the subsequent episode of every thing you’ll want to find out about utilizing random-number mills. As I already talked about, we’ll talk about drawing random samples with out substitute. Within the third installment, I’m fairly positive we’ll talk about random samples with substitute. After that, I’m just a little uncertain in regards to the ordering, however I wish to talk about oversampling of some teams relative to others and, individually, talk about the manufacturing of fictional knowledge.
Am I forgetting one thing?
