Monday, December 15, 2025

The way to automate widespread duties


Automating widespread duties is essential to efficient knowledge evaluation. Automation saves you plenty of time from repeating the identical units of operations, and it reduces errors by decreasing what you need to repeat.

Let’s automate one thing utilizing Stata. The duty we’re automating doesn’t a lot matter. What issues is that we get snug with how you can automate duties.

We’ll automate the easy activity of normalizing a variable. That’s to say, subtracting the variable’s imply and dividing by its commonplace deviation.

Simply so , there are already community-contributed instructions to do that and to do it extra flexibly than we’ll. Kind search normalize variable in Stata, and you will note a type of instructions. (You will notice issues about different sorts of normalization that don’t have anything to do with normalizing a variable, however the command of curiosity is simple to select.) You may also normalize a single variable utilizing Stata’s egen command, however we’re going to do greater than that.

As with all of the articles on this sequence, I assume the reader is new to automating duties in Stata. So, in case you are already an knowledgeable, these articles might maintain little curiosity for you. Or maybe you’ll nonetheless discover one thing novel.

Scripting

First, we’ll simply carry out the normalization straight in the course of our evaluation script. In Stata, we name evaluation scripts do-files as a result of they do one thing.

Let’s normalize the variable named x. I don’t like to alter the content material of current variables, so I’m going to create a brand new variable xN, the place the N suffix signifies normalization. If you happen to don’t just like the N suffix, use one thing else, maybe _norm. Or use a prefix. Stata’s summarize command will give us the imply and commonplace deviation.

mydo.do

...

summarize x
generate xN = (x - r(imply)) / r(sd)

...

That’s it. It takes solely two strains to normalize a variable.

What are r(imply) and r(sd), and the way did I find out about them? In Stata, nearly all instructions return outcomes. Estimation instructions return their outcomes as e() values, and most different instructions return their outcomes as r() values. I realized the names r(imply) and r(sd) by typing assist summarize and scrolling to the underside of the assistance file. There I discovered all the outcomes returned by summarize and their descriptions. I may even have merely typed return listing after the summarize command. return listing exhibits us every returned outcome and its worth. Or I may have gone to the complete guide entry for summarize and skim concerning the returned outcomes there. You don’t even want Stata to see the assistance file or documentation.

Click on on this to see the assistance file,

https://www.stata.com/assist.cgi?summarize

Or click on on this to see the guide entry,

https://www.stata.com/manuals/rsummarize.pdf

That’s the fast method. If you happen to simply wish to browse Stata’s documentation, click on on

https://www.stata-press.com/manuals/documentation-set/

Click on in your guide of curiosity, and browse the desk of contents.

Sorry for the digression, however discovering issues is essential. Again to our script.

Our activity is barely two strains lengthy. Why on earth would we automate it? Even in these two strains, there may be ample room for error. If you happen to block copy the code to normalize one other variable, or say you block copy it 100 occasions to normalize 100 different variables, take care that you just change x to your new variable identify in every single place it should be modified. Overlook to alter it in summarize, and your new variable, say, y, is normalized by x‘s imply and commonplace deviation. Overlook to alter it in xN, and also you get an error message. Overlook to alter it within the expression for xN, and your new variable will likely be x normalized by y‘s imply and commonplace deviation. I’ve made all of those errors.

And also you’re nonetheless listening to me?

Do-file automation

Let’s put our script into its personal do-file.

normalize.do (1)

model 15.1

summarize x
generate xN = (x - r(imply)) / r(sd)

I added one factor. The model command on the high. All the time, at all times, at all times model your do-files. I’m working Stata 15.1, so that’s what I put on the high. If I try this, this script will at all times work the way in which it does right this moment, even when some future Stata, say, Stata model 42, does away with the summarize command or utterly modifications how summarize works.

We run our new script by typing

. do normalize

or by placing do normalize in an evaluation do-file.

Our present normalize.do just isn’t too fascinating. We want it to work on variables aside from x.

Here’s a model that does simply that:

normalize.do (2)

model 15.1

summarize `1'
generate `1'N = (`1' - r(imply)) / r(sd)

We then kind

. do normalize y

What modified from (1) to (2)? All we did was exchange each incidence of x with `1′. Why `1′? Stata’s do-files parse their arguments into native macros numbered 1, 2, 3, and so forth. The primary argument goes into native macro 1, the second into 2, and so forth. What’s a neighborhood macro? It’s only a identify that holds a worth. Sure, 1 generally is a native macro identify. Why can we encompass the 1 with a left tick and a proper tick? If we simply kind 1, that may be the number one. We want the worth in 1, so we dereference it. Dereference is only a fancy phrase for get its worth. As a result of we typed do normalize y, our first (and solely) argument is y, so `1′ dereferences to y. If you happen to don’t just like the phrase “dereference”, simply say `1′ expands to y.

While you substitute y for `1′ in our second model of normalize.do, it turns into the primary model. That’s precisely what Stata does.

With our new normalize.do, we are able to gleefully kind

. do normalize myvariable
. do normalize myothervariable
. do normalize x1
. do normalize x2
...
. do normalize x100

I’m a lot much less more likely to make errors.

There’s nonetheless numerous redundant typing. We’ll return to that later.

What I wish to ask now could be, Can we make this do-file respect Stata’s if qualifier? The reply should be “sure”, and simply. In any other case, I wouldn’t have requested.

Why do we would like an if qualifier? We would wish to kind

. do normalize revenue if male == 0

and prohibit our normalization to females within the pattern. That’s what if male == 0 says.

Right here’s a do-file that respects each the if and in qualifiers.

(If you happen to don’t know what an in qualifier is, click on on https://www.stata.com/assist.cgi?in to see.)

normalize.do (3)

model 15.1

syntax varlist(min=1 max=1) [if] [in]

summarize `varlist' `if' `in'
generate `varlist'N = (`varlist' - r(imply)) / r(sd)   `if' `in'

Once we have a look at the final two strains of code, those tailored from our earlier do-file, we see two modifications—`1′ has been changed in every single place with `varlist’, and each instructions have `if’ `in’ added to the tip of the command. As a result of we claimed our do-file now straight helps if and in qualifiers, that new syntax command appears to be performing numerous magic, and certainly it’s.

syntax parses instructions that seem like commonplace Stata instructions. That’s to say, instructions which have a variable listing (varlist), an elective if qualifier, an elective in qualifier, and choices. We don’t have any choices but, however now we have every part else. I’m simplifying right here; syntax can do much more.

What is actually cool concerning the syntax command is that you just mainly kind what your command itself seems to be like, and syntax parses the command line, filling in native macros with related items of your syntax. It additionally points error messages when what’s typed doesn’t match the syntax you’ve gotten specified. That’s the reason we went to the difficulty so as to add (min=1 max=1) to varlist on our syntax command. We may have simply typed varlist, however then syntax would have allowed multiple variable to be specified. And that may not work on our generate command. We would like just one variable. The if and in qualifiers are elective, and that’s the reason they seem in brackets on the syntax command. If we had typed if in relatively than [if] [in], the if and in qualifiers can be required. Requiring each can be uncommon, however I’ve made the if qualifier required on some instructions.

There was an issue with our first two do-files that I ignored. I by no means checked that `1′ was an unabbreviated variable identify. Stata permits abbreviated variable names. In case you have a variable international and no different variables which are abbreviated to for, then typing

. do normalize for

would have created the brand new variable forN, not foreignN. You might be fantastic with that; it’s possible you’ll not. Regardless, you would need to watch out. There are methods to repair that in our earlier variations, however we received’t hassle.

Our present model doesn’t have that downside. Even when for is typed on the command line, `varlist’ expands to the unabbreviated variable identify international. That’s a part of the magic of syntax.

Now, again to that redundant typing. What if we wish to normalize a bunch of variables en masse? That, too, is simple sufficient, however we should lastly add to our two strains of computational code.

Right here’s a do-file that takes a listing of variables and normalizes every whereas respecting if and in.

normalize.do (4)

model 15.1

syntax varlist [if] [in]

foreach var in `varlist' {
    summarize `var' `if' `in'
    generate `var'N = (`var' - r(imply)) / r(sd)   `if' `in'
}

Taking it from the highest. We eliminated the (min=1 max=1) from the syntax command as a result of now we wish to settle for a varlist.

The foreach command is new however simple to grasp. For every var within the variable listing, run the 2 instructions now we have been working all alongside. `varlist’ expands to the listing of variables specified to our do-file. var is only a identify to carry a single variable identify as we loop over them separately. We may have used variable, or simply v, and even z, it could not matter.

In our summarize instructions, we now use `var’, so we’re accessing a single variable.

That’s it.

We are able to now kind issues like

. do normalize x1 x2 x3  if male==0

or

. do normalize x*

normalize.do will now take legitimate Stata varlists. If you happen to don’t already know, click on on https://www.stata.com/assist.cgi?varlist to see what all which means.

Creating a brand new command

Our little automation course of has led to one thing fairly versatile and helpful. Maybe too helpful to maintain it as a do-file. Perhaps we must always flip it into a brand new Stata command that we are able to use on any of our tasks and even share with our colleagues.

Once more, if that have been onerous, I might not have raised the likelihood.

We’re going to create an ado-file, an automatic do-file. A program outlined in an ado-file acts like a brand new command in Stata. It’s mechanically discovered and run.

With out additional ado, right here it’s,

normalize.ado (a)

program normalize
    model 15.1

    syntax varlist [if] [in]

    foreach var in `varlist' {
        summarize `var' `if' `in'
        generate `var'N = (`var' - r(imply)) / r(sd)   `if' `in'
    }
finish

What did we do? We indented the code from model (4) of our do-file, however that was only for prettiness. We added program normalize on the high of the file. We added finish on the backside of the file. These latter two issues say to deal with this as a command, so we don’t have to kind do in entrance of it.

On the threat of being repetitive, that’s it.

We now have a program that will likely be mechanically discovered and run every time we kind normalize.

We are able to now kind

. normalize x1 x2 x3  if male==0

or

. normalize x*

We may give the file normalize.ado to our colleagues, and it’ll work for them too.

Now get on the market and automate some duties of your individual.

Some bookkeeping

I say that normalize.ado will simply be mechanically out there. Will probably be, should you put it the place it may be discovered. Whether it is in you present working listing, it may be discovered. However it’s possible you’ll not wish to place it in every working listing. And what should you make it higher? Then you need to change it in a number of locations. As an alternative, in Stata, kind

. adopath

One of many directories on that path will likely be labeled (PERSONAL). Copy normalize.ado there. It’s going to now be present in your whole tasks, no matter what listing you might be working in.

If you happen to give normalize.ado to colleagues, inform them to repeat it to their (PERSONAL) directories.

I don’t at all times take automation this far. I’ve discovered it helpful to cease at variations (1), (2), (3), or (4) of our do-files. Or go all the way in which to a brand new command.

Additionally, it was no accident that we referred to as this system program normalize and put it within the file normalize.ado. The identify of this system and the identify of the file should be the identical.

Yet one more element. Your do-file is reloaded from the normalize.do file each time you kind do normalize …. Your ado-file program stays in Stata’s reminiscence after you kind normalize …. The following time you kind normalize, Stata runs this system from reminiscence with out rereading the normalize.ado file. Nice, that’s sooner. However … in case you are debugging your program and enhancing the file, your modifications won’t be reloaded. You must kind discard earlier than typing normalize …. That method, your program will likely be dropped from reminiscence and will likely be reloaded out of your file.

There are simple methods to share your new instructions with the entire Stata Group. Check out the FAQ How do I share a brand new command with Stata customers?

Typical course of

Here’s a typical automation course of:

   1. Code the answer to a particular downside.
      a. Discover you might be copying that code time and again.
      b. Ask what you modify from one downside to a different.

   2. Write a do-file that takes these issues that change as arguments.
      a. Refine.
      b. Take a look at.
      c. Repeat 2a and 2b till comfortable.

   3. Perhaps flip your do-file into an ado-file.

   4. Perhaps share your ado-file along with your colleagues.

   5. Perhaps share your ado-file with the entire Stata neighborhood.

Congratulations, now you can automate widespread duties in Stata. Whether or not you meant to or not, you’re in your technique to turning into a programmer. Seize your self a extremely caffeinated beverage.

If you happen to’re pleased with what now we have executed thus far, this could be a fantastic time to give up studying.

Addendum: Yet one more addition

You won’t like mechanically attaching an N to the tip of your unique variable identify to designate the normalized variable. Perhaps you want to use a distinct letter, or perhaps a set of characters, say, _norm. Otherwise you would possibly desire a prefix to a suffix. Goodness, perhaps you need each.

We are able to accommodate that.

normalize.ado (b)

program normalize
    model 15.1

    syntax varlist [if] [in] [ , prefix(name) suffix(name) ]

    foreach var in `varlist' {
        summarize `var' `if' `in'
        generate `prefix'`var'`suffix' = (`var' - r(imply)) / r(sd)   `if' `in'
    }
finish

From model (a) to model (b), all we did was change

syntax varlist [if] [in]

to

syntax varlist [if] [in] [, prefix(name) suffix(name)]

and alter

generate `var'N = ...

to

generate `prefix'`var'`suffix' = ...

Let’s perceive the modifications to the syntax line.

The sq. brackets once more imply elective; customers don’t have to kind something right here.

In the event that they do kind something, they have to first kind a comma ,. They’ll then kind both prefix(pstuff) or suffix(sstuff), or each. In the event that they kind prefix(pstuff), then the native macro prefix will include no matter they kind throughout the parentheses—pstuff. The native macro suffix will include no matter customers kind within the parentheses of the suffix possibility.

We have been cautious once we wrote our syntax command. As a result of we wrote (identify) and never (string), customers can not kind simply something within the parentheses. No matter they kind should be a authorized Stata variable identify. We intend to make use of what’s typed as a prefix or suffix to a variable identify, in order that string should itself not include something that may be unlawful in a variable identify.

Now, what does

generate `prefix'`var'`suffix' = ...

imply?

The macros `prefix’ and `suffix’ are simply expanded to regardless of the consumer typed within the prefix and suffix choices. Our new variable may have the prefix and the suffix that the consumer typed.

With our new ado-file, we are able to now kind issues like

. normalize x1 x2 x3 x4 , prefix(norm_of_)
. normalize x* , prefix(norm_of_)

The primary line creates 4 new variables: norm_of_x1, norm_of_x2, norm_of_x3, norm_of_x4. I’m not keen on these names, but it surely’s fairly clear what they imply. Except you might be considering matrices. Typically, these are referred to as standardized variables, so that you would possibly desire prefix(std_).

Within the second line, x* matches all variables that start with x. Every of them will likely be normalized and a brand new variable created with the designated prefix norm_of_.

You might need seen a lurking bug. If the consumer varieties neither a prefix() nor a suffix() possibility, then each `prefix’ and `suffix’ will likely be clean. Our generate command goes to attempt to create a variable with the identical identify as the unique variable. And that … is a syntax error.

One technique to keep away from that error is to default to our unique conduct of suffixing the brand new variable with an “N”. We try this by including the next three strains proper under our syntax line,

if "`prefix'`suffix'" == "" {
    native suffix "N"
}

They merely say, if each prefix and suffix (`prefix’`suffix’) are empty, then assign “N” to suffix.

One other good little enchancment can be so as to add a label to our new variable. Right here’s a risk:

label variable `prefix'`var'`suffix' "`var' normalized"

Which we’d add proper after the generate command. Contained in the for loop.

And that’s how packages turn out to be lengthy. You enhance them, and also you add options. Maintain at this, and you’ll quickly be writing blocks of code that intimidate your colleagues.



Related Articles

Latest Articles