Friday, May 1, 2026

Stata/MP – having enjoyable with hundreds of thousands


I used to be reviewing some timings from the Stata/MP Efficiency Report this morning. (For many who don’t know, Stata/MP is the model of Stata that has been programmed to benefit from multiprocessor and multicore computer systems. It’s functionally equal to the biggest model of Stata, Stata/SE, and it’s quicker on multicore computer systems.)

What was uncommon this morning is that I used to be working Stata/MP interactively. We often run MP for big batch jobs that run 1000’s of timings on massive datasets — both to tune efficiency or to supply reviews just like the Efficiency Report. That’s the kind of work Stata/MP was designed for — large jobs on large datasets.

I’ll admit proper now that I largely run Stata interactively utilizing the auto dataset, which has 74 observations. I run Stata/MP utilizing all 4 cores of my quad-core pc, however I’m largely losing 3 of them — there is no such thing as a dashing up the computations on 74 observations. This morning I used to be working Stata/MP interactively on a 24-core pc utilizing a considerably bigger dataset.

After some time, I used to be struck by the truth that I wasn’t noticing any annoying delays ready for instructions to run. It felt nearly as if I have been working on the auto dataset. However I wasn’t. I used to be working instructions utilizing 50 covariates on 1 million observations! Regressions, abstract statistics, and so forth.; this was enjoyable. I had by no means performed interactively with a million-observation dataset earlier than.

Out of curiousity, I turned off multicore assist. The change was dramatic. Instructions that have been taking lower than a second have been now taking longer, too lengthy. My espresso cup was full, however I contemplated fetching a snack. Working on just one processor was not a lot enjoyable.

On your info, I set rmsg on and ran a number of timings:

Timing (seconds)
Evaluation 24 cores 1 core
generate a brand new variable .03 .33
summarize 50 variables .88 19.55
twoway tabulation .45 .45
linear regression .65 11.48
logistic regression 7.19 59.27
All timings are on a 1 million commentary dataset.
The 2 regressions included 50 covariates.

OK, the timings with 24 cores usually are not fairly the identical as with the auto dataset, however properly inside snug interactive use.

Cautious readers could have seen that the 24-core and 1-core timings for twoway tabulation are the identical. We’ve not rewritten the code for tabulate to assist a number of cores, partly as a result of tabulate is already very quick, and partly as a result of the code for tabulate is remoted, so altering it won’t enhance the efficiency of different instructions. Thus, parallelizing tabulate is on our long-run, not short-run, listing of additives to Stata/MP. We’ve rewritten about 250 sections of Stata’s inside code to assist Symmetric Multi Processing (SMP). Every rewritten part usually improves the efficiency of many instructions.

I switched again to utilizing all 24 cores and returned to my unique work — stress testing adjustments within the variety of covariates and observations. My enjoyable was quelled once I began working some timings of Cox proportional hazards regressions. With my 50 covariates and 1 million observations, a Cox regression took simply over two minutes. Most estimators in Stata are parallelized, together with the estimators for parametric survival fashions. The Cox proportional hazards estimator isn’t. It’s not parallelized as a result of it makes use of a intelligent algorithm that requires sequential computations. After I say sequential I imply that some computations are wholly depending on earlier computations in order that they merely can’t be carried out concurrently, in parallel. There are different algorithms for becoming the Cox mannequin, however they’re orders of magnitude slower. Even parallelized, they might not be quicker than our present sequential algorithm until run on 20 or extra processors. When extra computer systems begin delivery with dozens of cores, we’ll consider including a parallelized algorithm for the Cox estimator.

The pc I used to be working on is a couple of 12 months outdated. There have been a spate of latest and quicker server-grade processors from Intel and AMD previously 12 months. You will get fairly near the efficiency of my 24-core pc utilizing simply 8-cores and the newer chips. That implies that with a more recent 32-core pc, I may enhance my threshold for interactive evaluation to about 4 million observations.

There are 4 pace comparisons above. To see 450 extra, together with graphs and a dialogue of SMP and its implementation in Stata, see the Stata/MP white paper, a.okay.a. the Stata/MP Efficiency Report.



Related Articles

Latest Articles