From datasets to framesets and alias variables: Knowledge administration advances in Stata

The purpose of this weblog is to explain two novel options launched in Stata 18 (launched in 2023): 1) framesets and a couple of) alias variables throughout frames. These options allow Stata to take care of a multiplicity of probably very giant datasets effectively and conveniently. Framesets permit you to bundle, save on file, and cargo in reminiscence a set of associated frames that maintain datasets. Alias variables permit you to entry variables in different frames as in the event that they have been half of the present body, with little or no reminiscence overhead.

Knowledge administration in Stata

When Stata 1.0 was launched in 1985, information have been organized in a tabular type as observations (rows) and variables (columns) and have been known as a dataset. Datasets have been stored fully in reminiscence (then measured in kilobytes) and saved on disk as .dta information. Knowledge sorts, like integers, actual numbers, and particularly strings, have been frugally managed. A lot of the maiden 44 instructions have been for information administration, together with the nonetheless inescapable generate, substitute, and checklist. This underlying framework has remained the bedrock for the 17 variations of Stata that adopted: datasets are nonetheless stored as tables fully in reminiscence, with strongly typed languages to course of the information. This makes Stata quick and permits billions of observations to be processed in milliseconds. Nevertheless, holding complete datasets in reminiscence is restrictive with very giant datasets. Nonetheless, leveraging on the outstanding development of inexpensive reminiscence, Stata’s information administration capabilities stored getting greater, stronger, and quicker. On this weblog, I focus on new options for dealing with giant datasets, specifically, frames, framesets, and alias variables. I describe these options intimately within the subsequent three sections. In an appendix on the finish of this weblog, I present an outline of how Stata’s information administration capabilities have grown over time.

Frames: A framework for a number of datasets

With giant and sophisticated information, there may be typically have to work with a number of, and probably enormous, datasets concurrently. It’s possible you’ll wish to multitask and work with numerous datasets for numerous tasks. Or you might be working with a set of associated datasets and wish to consolidate statistics throughout them. There are Stata instructions, like protect and restore, that allow you to modify from one dataset to a different. However these require some cautious coding and entail a time penalty for saving and restoring datasets to and from disk.

In Stata 16 (2019), a brand new framework for dealing with a number of datasets was launched: frames. A number of datasets will be stored in reminiscence in a number of frames. For instance, right here is how one can create a body with body create, make that body the present (working) body with body change, and cargo a dataset into it:

. body create auto

. body change auto

. sysuse auto
(1978 car information)

You can also make a replica of a body and rename a body:

. body copy auto auto1

. body rename auto1 automobiles

Names of datasets and frames that maintain them will be totally different. Additionally, even when there are a number of frames in reminiscence, you’ll be able to interactively work with one body (the present body) at a time. You may determine the present body with pwf (print working body):

. pwf
  (present body is auto)

You all the time work with the present body, by default. That stated, the body prefix functionality lets you run a command on a body aside from the present one. For instance, you’ll be able to generate a brand new variable, say, newvar (with random values right here), in body automobiles:

. body automobiles: generate newvar = runiform()

You may also use frlink to create a hyperlink between the present body and one other body. As an illustration, you’ll be able to create a one-to-one hyperlink (by specifying 1:1) between present body auto and body automobiles by matching the observations on variable make (that retains makes of automobiles):

. frlink 1:1 make, body(automobiles)
(all observations in body auto matched)

You may delete a body (if it’s not the present one) with body drop:

. body drop automobiles

You may reset frames with

. frames reset

That can reset Stata to a state the place a single, empty body is in reminiscence.

You are able to do extra with frames: copy information variables and observations with body put, add new observations with body publish, and so forth.

Word that instructions associated to frames and framesets work precisely the identical approach whether or not you kind body or frames; they’re synonymous. For a very good introduction on frames, see assist frames intro.

Stata helps as much as 100 frames. Identical to particular person datasets, all frames are stored fully in reminiscence. This makes working with frames, too, very quick. But it surely assumes you’ll be able to match all frames information in reminiscence, a constraint that motivated the 2 new options in Stata 18.

New in Stata 18: Framesets

Stata 18 provides a pure evolution of the frames idea: customers can now save on disk, in a memory-efficient approach, a set of frames. A brand new information file format is launched for framesets: .dtas, the plural of .dta.

For instance, let’s create three frames and cargo into them three totally different datasets (associated to life expectancy):

. body create life0

. body create life1

. body create life2

. body life0: sysuse lifeexp
(Life expectancy, 1998)

. body life1: sysuse uslifeexp
(U.S. life expectancy, 1900-1999)

. body life2: sysuse uslifeexp2
(U.S. life expectancy, 1900-1940)

It can save you these three frames in a single frameset file, say, life.dtas, with

. frames save life, frames(life0 life1 life2)
file life.dtas saved

You may later reset or clear all frames and cargo frames saved in life.dtas with

. frames reset

. frames use life
  life0  68 x 6; Life expectancy, 1998
  life1  100 x 10; U.S. life expectancy, 1900-1999
  life2  41 x 2; U.S. life expectancy, 1900-1940

When working with a set of frames, it’s important to contemplate a variety of elements. For instance, what if the frames you wish to load from disk have the identical names as these in reminiscence? Which body turns into the present body when a frameset is loaded? What should you attempt to load a beforehand linked body that doesn’t exist anymore?

I present frames describe, which takes inventory of frames and the variables they maintain, each in reminiscence and on disk. For instance, the next provides a (quick) description of frames in frameset life.dtas:

. frames describe utilizing life, quick

-------------------------------------------------------------------------------
Body: life0

Incorporates information                                 Life expectancy, 1998
 Observations:            68                  26 Aug 2023 20:06
    Variables:             6
Sorted by:
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
Body: life1

Incorporates information                                 U.S. life expectancy, 1900-1999
 Observations:           100                  26 Aug 2023 20:06
    Variables:            10
Sorted by: 12 months
-------------------------------------------------------------------------------

-------------------------------------------------------------------------------
Body: life2

Incorporates information                                 U.S. life expectancy, 1900-1940
 Observations:            41                  26 Aug 2023 20:06
    Variables:             2
Sorted by: 12 months
-------------------------------------------------------------------------------

Frameset instructions additionally retailer quite a few r-results to maintain observe of what’s taking place, for instance, the subset of frames being saved or loaded, whether or not information in every body has modified in reminiscence, and so forth.

As with .dta information, we offer the low-level description of .dtas information. assist dtas offers all the main points wanted to produce other software program learn and write .dtas information.

The syntax and choices of frameset instructions observe, fairly naturally, these of dataset instructions, like save, use, and describe. For instance, dataset and frameset instructions deal with, in the identical approach, issues like labels, empty datasets, the extent of element in describing datasets, and so forth.

Stata makes use of its native zipfile to compress frameset information in frames save, and unzipfile to extract information in frames use. The person can specify the compression stage for frames save. This may be carried out in two methods: by the complevel(#) choice or by set dtascomplevel #. # is an integer between 0 and 9—0 means no compression and 9 means most compression. The default is 1. For instance, life.dtas will be saved and changed on disk with most compression by typing

. frames save life, frames(life0 life1 life2) complevel(9) substitute
file life.dtas saved

Word that frames and framesets are constructed on prime of datasets. Meaning you’ll be able to hold working with datasets in precisely the identical approach as you probably did earlier than, if frames and framesets should not of sensible curiosity to you. The one factor you most likely have to know is that once you use a dataset, it goes right into a body by default—and this body is, unsurprisingly, named default. On the finish of the day, even with frames, you interactively work with one dataset or one body at any given time.

New in Stata 18: Alias variables throughout frames

On this part, I describe how alias variables can be utilized to entry variables throughout frames in a memory-efficient method.

Two datasets in numerous frames will be associated by having matching variables. As talked about earlier, you’ll be able to hyperlink frames with frlink by matching observations within the present body with observations within the associated body, primarily based on frequent variables.

After creating hyperlinks with frlink, you should use fralias add to outline variable aliases—names that reference variables in a linked body.

Right here is an instance of including an alias variable. First, let’s arrange the auto and automobiles frames in reminiscence as we did above.

. clear all

. body create auto

. body change auto

. sysuse auto
(1978 car information)

. body copy auto automobiles

. body automobiles: generate newvar = runiform()

. pwf
 (present body is auto)

The 2 frames are the identical, apart from variable newvar added to automobiles. From the present body auto, you’ll be able to create a one-to-one hyperlink with automobiles, primarily based on frequent variable make:

. frlink 1:1 make, body(automobiles)
(all observations in body auto matched)

Now, an alias variable, say, newvar, will be created in present body auto to entry variable newvar in automobiles:

. fralias add newvar, from(automobiles)
(1 variable aliased from linked body)

Right here the alias variable has the identical title because the variable it factors to. However it may be totally different. We’ll present how within the subsequent instance.

In essence, fralias add defines references from the present body to variables in linked frames. The references allow you to work with the linked variables with out copying them within the present body. These references eat little or no reminiscence; the variables are literally saved solely in a single body or dataset however will be made obtainable in numerous frames.

Listed below are just a few extra feedback about frlink, on which fralias is based. While you use frlink, a brand new variable is created within the present body. It references the linked body. By default, the brand new variable is known as after the linked body. However a unique variable title will be generated with choice generate().

Additionally, the matching of observations with frequent variables carried out by frlink will be one to at least one or many to at least one. Somewhat usefully, frlink may even deal with variables which might be frequent in numerous frames however with totally different names. Moreover, frlink can match teams of variables utilizing wildcard * in variable names. Ought to there be modifications in information, or frames renamed, hyperlinks will be rebuilt with frlink rebuild or dropped by dropping the hyperlink variable.

Alias variables created by fralias add are handled like another variable in your dataset, with the caveat that you’re not allowed to alter their values. For a given alias variable, should you change the corresponding variable’s values within the linked body the place they reside, the modified values are robotically obtainable the following time you employ the alias variable. So altering the variables in a single body is enough, and the change is mirrored in all frames that reference them.

Alias variables enable many frames to have the identical variable as if it belongs to all of them, however the variable is saved in just one body. This avoids creating duplicates of variables or utilizing costly instructions like merge or frget. The latter, for instance, copies variables from a linked body with a big reminiscence footprint, particularly with costly information sorts like double and string. In distinction, alias variables, being mere references in reminiscence, have small, fastened reminiscence footprints. Utilizing alias variables is due to this fact reminiscence environment friendly and helps afford holding all frames in reminiscence, which retains Stata fast and nimble.

Instance of frameset and alias variable

On this part, I present a extra full instance and delve into further options of frameset and alias variable instructions.

Suppose you might be engaged on a undertaking concerning the revenue stage within the state of Texas in the USA and wish to analyze the information at individual and county stage (every United States state includes counties).

You might be utilizing two Stata datasets: individuals.dta and txcounty.dta. You may load the 2 datasets in two frames, say, individuals and counties, as follows:

. clear all

. body create individuals

. body change individuals

. webuse individuals

. body create counties

. body change counties

. webuse txcounty
(Median revenue in Texas counties)

You may describe the 2 frames with the body prefix:

. body individuals: describe

Incorporates information from https://www.stata-press.com/information/r18/individuals.dta
 Observations:            20
    Variables:             3                  16 Apr 2022 13:36
                                              (_dta has notes)
----------------------------------------------------------------------
Variable      Storage   Show    Worth
    title         kind    format    label      Variable label
----------------------------------------------------------------------
personid        byte    %9.0g                 Individual ID
countyid        byte    %9.0g                 County ID
revenue          float   %9.0g                 Family revenue
----------------------------------------------------------------------
Sorted by:

. body counties: describe

Incorporates information from https://www.stata-press.com/information/r18/txcounty.dta
 Observations:             8                  Median revenue in Texas counties
    Variables:             2                  30 Dec 2022 06:13
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Show    Worth
    title         kind    format    label      Variable label
-------------------------------------------------------------------------------
countyid        byte    %9.0g      cty        County ID
median_income   float   %9.0g                 Family median revenue
-------------------------------------------------------------------------------
Sorted by:

With clear all above, we robotically began with an empty working body known as default. We then added two frames on prime of default. We are able to checklist the frames in reminiscence and determine the present body with

. frames dir
  counties  8 x 2; Median revenue in Texas counties
  default   0 x 0
  individuals   20 x 3; individuals.dta

. pwf
  (present body is counties)

counties is the present body as a result of it’s the final body we modified to. If we wish to work with individuals, we’ve to alter to that body:

. body change individuals

As a result of frames individuals and counties have frequent variable countyid, we are able to use frlink to hyperlink present body individuals to border counties, primarily based on countyid. As a result of many individuals belong to the identical county, the matching right here is many to at least one (m:1):

. frlink m:1 countyid, body(counties)
(all observations in body individuals matched)

The matching variables wouldn’t have to have the identical title. It’s slightly straightfoward to do the linking in such instances. assist frlink has the main points.

Word that the frlink command above created a brand new variable in individuals and is known as counties. It’s named after the linked body. Choice generate() might have been utilized in frlink to create a unique variable title. The values of the brand new variable are matching statement numbers in counties.

Now you can use frames save to save lots of, on disk, body individuals and all different frames linked to it by specifying choice linked; all frames are saved in file myproject.dtas:

. frames save myproject, frames(individuals) linked
file myproject.dtas saved

Word that solely body counties is linked to the present body on this case, given the frlink command above. So counties can be saved in myproject.dtas, apart from individuals.

Subsequent, you’ll be able to reset all frames in reminiscence and later remind your self what’s there in myproject.dtas with frames describe (we use choice easy for a compact description):

. frames reset

. frames describe utilizing myproject, easy

--------------------------------------------
Body: individuals
personid  countyid  revenue    counties
--------------------------------------------

--------------------------------------------
Body: counties
countyid       median_income
--------------------------------------------

You may later load all frames saved in myproject.dtas in reminiscence with frames use:

. frames use myproject, frames(_all)
  counties  8 x 2; Median revenue in Texas counties
  individuals   20 x 4

Word that, at this level, the present body is default, as pwf reveals:

. pwf
 (present body is default)

Though two frames have been loaded in reminiscence, the present body (default on this case) didn’t change with frames use. To work with one of many loaded frames, say, individuals, it’s important to explicitly specify it because the working body:

. body change individuals

Subsequent, you wish to examine incomes of particular person individuals to the median revenue of the county. The median revenue is accessible in body counties. We all know that individuals is linked to counties primarily based on the frlink command above. We are able to confirm the present linkages from the present body (individuals) with

. frlink dir
  (1 frlink variable discovered)
  -----------------------------------------------------------------------------
  counties created by frlink m:1 countyid, body(counties)
  -----------------------------------------------------------------------------
  Word: Sort "frlink describe varname" to seek out out extra, together with whether or not the
  variable continues to be legitimate.

To entry variable median_income in body counties, you’ll be able to add an alias variable, say, median, to reference the variable as follows:

. fralias add median = median_income, from(counties)
(1 variable aliased from linked body)

You may describe the alias variable with

. fralias describe median

----------------------------------------------------
Alias    Sort    Goal          Hyperlink       Body
----------------------------------------------------
median   float   median_income   counties   counties
----------------------------------------------------

Now you can run analyses in body individuals that embody variable median. Very merely right here, yow will discover the ratio of particular person revenue to the corresponding county median revenue:

. generate ratio = revenue/median

Word that alias variable median merely references median_income in counties, which consumes little reminiscence. So you’ll be able to work with the variable as if it have been a part of the body, with little or no reminiscence overhead. However you can not change the variable; it may be modified solely in body counties. Any change within the variable will likely be obtainable in all frames that reference it.

Abstract

On this publish, I described two information administration options not too long ago launched in Stata: framesets and alias variables. Whereas sticking to fundamental ideas that make information processing in Stata easy, intuitive, and quick (like holding the whole dataset in reminiscence), we stored augmenting Stata’s capabilities in information administration. The potential to deal with giant and sophisticated datasets look a leap in Stata 16 with the introduction of frames: a number of, and presumably associated, datasets will be concurrently stored in reminiscence as frames. In Stata 18, we adopted up with a pure evolution of frames: the flexibility to save lots of a number of datasets or frames in a single, compressed file and later restore the saved frames in reminiscence. We launched a brand new file format, the .dtas file. We additionally offered alias variables that allow entry to variables in linked frames. Alias variables throughout frames is a strong software that conveniently and effectively permits entry to variables in numerous frames with out spending reminiscence by producing copies of the variables or utilizing costly instructions to mix very huge datasets.

Appendix: A abstract of Stata’s information administration capabilities

On this part, I describe the important thing stations alongside Stata’s journey in information administration. This offers a context for the current options launched. Whereas the core design and ideas that make Stata intuitive and quick haven’t modified, the information administration capabilities have constantly made main strides to deal with more and more giant and sophisticated datasets. Listed below are the highlights.

The restrict within the variety of observations had grown steadily, to over a billion within the MP (multiprocessor) model of Stata 14 (2015) and to over a trillion presently. Terabytes of reminiscence at the moment are supported. The utmost variety of variables was elevated to 120,000 in Stata/MP 15 (2017). Different information maxima additionally stored rising: size of assorted names, variety of choices for a command, size of worth labels, macros, and so forth. In apply, with computerized reminiscence administration launched in Stata 12 (2011) and enhanced in Stata 14 (2015), the utmost dimension limits, just like the variety of observations and variables, are primarily constrained by how a lot reminiscence is accessible. assist limits will inform you extra.
Earlier than Stata 13 (2013), strings have been restricted to 244 characters. Stata 13 launched a brand new information kind known as strL (lengthy strings), which elevated the utmost size of strings to 2 billion characters. This enabled studying giant information into strings and writing lengthy strings to information. Thus, a wide range of information might be dealt with in Stata instructions and features: Phrase paperwork, JPEG pictures, plain textual content ASCII, EBCDIC, binary, VARCHARs (variable character fields), BLOBs (binary giant objects), CLOBs (character giant object strings), and extra. assist datatypes has extra particulars.
Stata regularly launched help for an more and more huge number of specialised information: longitudinal/panel, survival/period, time sequence, survey, discrete alternative, spatial, and a number of imputations (to deal with lacking information).
All editions of Stata can be found on all main working techniques and {hardware} platforms—with full compatibility. Stata datasets, applications, and different information will be shared throughout editions and platforms with out translation.
With Stata/MP, the multiprocessor version of Stata launched in model 9 (2005), huge speedup was achieved. As much as 64 cores/processors will be supported. Subsequently, many instructions and built-in routines have been modified to reap the benefits of parallelization, wherever potential—from information administration duties like including variables and sorting to evaluation duties like regression and different computationally intense estimation instructions.
Knowledge will be imported from, and exported to, a rising variety of common file codecs, together with Excel, SAS, SPSS, dBase—apart from normal codecs like comma-separated values (.csv) and stuck column information. Stata offers help for JDBC and ODBC and database merchandise like Oracle, MySQL, Amazon Redshift, Snowflake, Microsoft SQL Server, and DB2. Stata additionally offers entry to information repositories just like the Federal Reserve Financial Knowledge, Wharton Analysis Knowledge Companies, Haver Analytics, Worldwide Statistical Classification of Ailments and Associated Well being Issues (ICD-9 and ICD-10).
Stata’s interoperability capabilities additionally made important inroads. There was rising integration with different improvement platforms like Java, Python, and H2O (for machine studying and predictive analytics). Stata turned internet conscious in model 8 (2003) with instructions like webuse. Thereafter, Stata made strides to seamlessly and effectively entry and interoperate with information sources and platforms over the Web and the Cloud.
The graphics-driven information editor was launched in Stata 8 (2003) after which constantly improved. Spreadsheet enhancing capabilities, like reside view of knowledge, including and altering observations/variables/cells, importing information, and copying and pasting, have been repeatedly enhanced.
Stata 14 (2015) launched help for Unicode (UTF-8). Subsequently, Stata added help for a number of languages in its interface, menus, and dialogs. Moreover English, Stata speaks Chinese language, Japanese, Korean, Spanish, and Swedish.
Mata is a programming language launched in Stata 9 (2005) with highly effective matrix capabilities. The matrices could include parts of or complete datasets. In truth, Mata matrices will be made with views of Stata datasets and frames and may have as much as 281 trillion rows and columns, if the pc has enough reminiscence. Mata is compiled and may be very environment friendly. It may possibly run as much as 40 instances quicker than Stata’s interpreted languages and is helpful for CPU and memory-intensive numerical strategies involving giant vectors and matrices.
To deal with bigger and extra advanced tasks, Stata launched a Challenge Supervisor in launch 13 (2013) to prepare information and evaluation information underneath a number of tasks.
Stata 16 (2019) launched frames. This offers the flexibility to maintain a number of datasets in reminiscence and work with them concurrently. Increase on the framework for frames, two new options (the main focus of this weblog) have been launched in Stata 18 (2023): the flexibility to save lots of and cargo units of frames (or framesets) and the flexibility to entry variables in numerous frames by alias variables.

Reference

Cox, N. J. 2015. A brief historical past of Stata on its thirtieth anniversary. In Thirty Years with Stata: A Retrospective, ed. E. Pinzon, 135–147. School Station, TX: Stata Press.

Assets

[D] frames
[D] frames intro
[D] frames save
[D] frames use
[D] frames describe
set dtascomplevel
[D] frlink
[D] fralias

https://www.stata.com/new-in-stata/frameset/
https://www.stata.com/new-in-stata/alias-variables-across-frames/
https://www.stata.com/options/overview/multiple-datasets-in-memory/
https://www.stata.com/options/data-management/

From datasets to framesets and alias variables: Knowledge administration advances in Stata

Related Articles

MaxLiveProtect: eBPF-Powered Community Infrastructure Safety

Posit AI Weblog: TensorFlow function columns: Reworking your information recipes-style

Iran struggle: US liable for strike on Iranian faculty, investigation finds

Latest Articles

MaxLiveProtect: eBPF-Powered Community Infrastructure Safety

Posit AI Weblog: TensorFlow function columns: Reworking your information recipes-style

Iran struggle: US liable for strike on Iranian faculty, investigation finds

‘Rectal garlic insertion for immune help’: Medical chatbots confidently give disastrously misguided recommendation, consultants say

Hadn’t the Satisfaction At all times Been within the Discovering Not the Discoveries?