Earlier than we match statistical fashions with our datasets, we sometimes undergo a couple of checks to substantiate that our knowledge are correct and full. No matter whether or not you’ve got obtained knowledge from a company or constructed the dataset your self, it’s worthwhile to examine for knowledge entry errors. Beneath, we are going to present you 4 important Stata instructions for performing high quality checks in your knowledge: duplicates, isid, assert, and misstable.
Duplicates
Now we have fictional knowledge on sufferers that underwent corrective eye surgical procedure. For every affected person we’ve got an identification quantity, the date they have been admitted for surgical procedure, their age and intercourse, and their systolic blood stress.
. use datacheck1
. describe
Accommodates knowledge from datacheck1.dta
Observations: 19
Variables: 7 22 Apr 2026 12:06
-----------------------------------------------------------------------------------------------------------------------------------
Variable Storage Show Worth
identify sort format label Variable label
-----------------------------------------------------------------------------------------------------------------------------------
patient_id float %9.0g Affected person ID
intercourse float %9.0g Intercourse
age float %9.0g Age
surgery_date float %td Surgical procedure date
birth_date float %td Start date
bpsystol float %9.0g Systolic BP
highbp float %9.0g BP 160+
-----------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Sufferers can have the surgical procedure solely as soon as, with an enhancement years later. Subsequently, we first need to make it possible for our info on affected person IDs and dates is right. We start by checking for any duplicate observations.
. duplicates report
Duplicates when it comes to all variables
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 15 0
2 | 4 2
--------------------------------------
Out of the 19 observations on this dataset, 15 are distinctive. For these 15 observations, we’ve got a single copy of the data. Nevertheless, there are 4 observations which might be duplicates. There are two sufferers for which we’ve got two copies. We checklist them beneath:
. duplicates checklist
Duplicates when it comes to all variables
+--------------------------------------------------------------------------------+
| Group Obs patien~d intercourse age surgical procedure~e birth_d~e bpsystol highbp |
|--------------------------------------------------------------------------------|
| 1 4 3 1 51 09sep2025 13aug1974 135 0 |
| 1 5 3 1 51 09sep2025 13aug1974 135 0 |
| 2 8 6 1 38 18nov2025 10oct1987 125 0 |
| 2 9 6 1 38 18nov2025 10oct1987 125 0 |
+--------------------------------------------------------------------------------+
Now we have two copies of affected person IDs 3 and 6; we are able to see that the identical info is repeated for all variables. Beneath, we drop the duplicates.
. duplicates drop
Duplicates when it comes to all variables
(2 observations deleted)
A community-contributed command that can also be helpful is distinct; this command will report the variety of distinct values for a number of variables. You too can report the variety of distinct teams outlined by a number of variables, such because the variety of distinctive teams outlined by affected person ID and surgical procedure date. Sort search distinct to study extra, and observe the directions to put in it if you want to make use of this command.
Distinctive identifiers
With the duplicates eliminated, we now examine whether or not observations are uniquely recognized by the mix of affected person ID and surgical procedure date; if they’re, isid will report nothing. We use the prefix seize to seize a return code in case isid does produce an error; that is helpful when positioned in do-files as a result of it permits your do-file to proceed to run regardless of any errors. We additionally use the noisily prefix so we are able to see the error message.
. seize noisily: isid patient_id surgery_date, kind
variables patient_id and surgery_date don't uniquely determine the observations
We see that patient_id and surgery_date don’t uniquely determine observations. Let’s examine whether or not we’ve got any duplicates for affected person ID:
. duplicates report patient_id
Duplicates when it comes to patient_id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 13 0
2 | 4 2
--------------------------------------
These duplicates are observations which have the identical worth for affected person ID however totally different values for different variables; in any other case, they might have been reported in our prior name to duplicates report. Let’s take a better take a look at the duplicates:
. duplicates checklist patient_id
Duplicates when it comes to patient_id
+------------------------+
| Group Obs patien~d |
|------------------------|
| 1 1 1 |
| 1 2 1 |
| 2 10 9 |
| 2 11 9 |
+------------------------+
. checklist if patient_id == 1 | patient_id == 9, abbrev(14)
+------------------------------------------------------------------------+
| patient_id intercourse age surgery_date birth_date bpsystol highbp |
|------------------------------------------------------------------------|
1. | 1 0 34 15mar2020 10feb1986 163 1 |
2. | 1 0 39 20mar2025 10feb1986 165 1 |
10. | 9 1 45 25sep2025 20jun1980 140 0 |
11. | 9 0 47 25sep2025 17jul1978 135 0 |
+------------------------------------------------------------------------+
Observations 1 and a couple of each have affected person IDs equal to 1; they’ve the identical worth for birth_date and intercourse. This appears to be the identical affected person; they initially had surgical procedure in 2020 and visited in 2025 for a touch-up. Subsequently, these two observations are duplicates for patient_id however not really duplicates as a result of they differ for different variables, like age and surgery_date. For some knowledge functions, you might need to drop some of these observations; you can achieve this by typing the next:
duplicates drop patient_id, drive
The drive choice is required right here since you are dropping observations which might be duplicates when it comes to one variable however which might be distinctive based mostly on values of different variables. If we have been to difficulty this command, we’d be dropping details about this affected person’s enhancement surgical procedure, which we don’t need to do; subsequently, bear in mind that you’re dropping knowledge when dropping some of these observations.
We additionally see that observations 10 and 11 each have affected person IDs equal to 9. They’ve the identical surgical procedure date however totally different values for start date and intercourse, so this appears to be an information entry error. We have to change the affected person ID for certainly one of these observations to a different worth; let’s examine the present vary of ID numbers.
. codebook patient_id
-----------------------------------------------------------------------------------------------------------------------------------
patient_id Affected person ID
-----------------------------------------------------------------------------------------------------------------------------------
Sort: Numeric (float)
Vary: [1,15] Items: 1
Distinctive values: 15 Lacking .: 0/17
Imply: 7.64706
Std. dev.: 4.52688
Percentiles: 10% 25% 50% 75% 90%
1 4 8 11 14
. change patient_id = 16 in 11
(1 actual change made)
Now we have affected person IDs starting from 1 to fifteen. To make it possible for the ID quantity is exclusive to every affected person, we are able to change the affected person ID to 0 or 16; we select 16.
codebook is helpful for checking the vary, models, and variety of lacking values for a variable. If you’d like a better take a look at the frequency for every worth, think about using fre; this community-contributed command creates one-way frequency tables, and it’s particularly helpful if you’re utilizing worth labels. For instance, you may need to examine what number of observations there are per county; fre would show the county quantity and label, similar to ‘‘1 Los Angeles’’ and ‘‘2 Bronx’’. Sort search fre to study extra, and observe the directions to put in it if you want to make use of this command.
We now run isid as soon as extra to substantiate that we are able to uniquely determine every affected person.
. isid patient_id surgery_date
Nothing is reported. We will affirm that we’ve got one remark per affected person and surgical procedure date.
Confirm fact of declare
Subsequent, we need to make it possible for our variable highbp was coded accurately. We take into account systolic blood pressures of 160, or larger, to be excessive. Let’s affirm that we’ve got a price of 1 for highbp for observations with a systolic blood stress of not less than 160. We specify the expression that highbp is the same as 1 when bpsystol is larger than or equal to 160; if the assertion is true for all observations, assert will report nothing. Nevertheless, if it isn’t true, even for only one remark, the output will tell us that it’s false.
. seize noisily: assert highbp == 1 if bpsystol >= 160
2 contradictions in 9 observations
assertion is fake
assert checks whether or not our expression is true for every remark, and it experiences that there are 2 contradictions. In case you are working with a big dataset, think about using the quick choice, which forces assert to cease on the first contradiction. This manner, you don’t have to attend whereas assert checks each remark.
There are two observations for which our expression is fake. This could be as a result of systolic blood stress was in reality low however highbp was mistakenly coded as 1 or as a result of blood stress was excessive however highbp was mistakenly coded as 0. We examine for each beneath.
. checklist if highbp == 1 & bpsystol <= 160
. checklist if highbp == 0 & bpsystol >= 160
+------------------------------------------------------------------+
| patien~d intercourse age surgical procedure~e birth_d~e bpsystol highbp |
|------------------------------------------------------------------|
13. | 11 0 24 12dec2025 11nov2001 179 0 |
15. | 13 1 34 26oct2025 15sep1991 . 0 |
+------------------------------------------------------------------+
For remark 13, highbp ought to as a substitute have been coded as 1. We make that change beneath.
. change highbp = 1 in 13
(1 actual change made)
Examine for lacking values
For remark 15, the blood stress was lacking, so highbp must be lacking too. Let’s see what number of lacking values we’ve got in our dataset.
. misstable summarize
Obs<.
+------------------------------
| | Distinctive
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
bpsystol | 2 15 | 12 115 187
-----------------------------------------------------------------------------
The variable bpsystol is the one one with lacking values. Let’s make it possible for highbp is lacking for the opposite remark for which bpsystol can also be lacking.
. checklist if lacking(bpsystol)
+------------------------------------------------------------------+
| patien~d intercourse age surgical procedure~e birth_d~e bpsystol highbp |
|------------------------------------------------------------------|
14. | 12 1 26 14nov2025 10oct1999 . 1 |
15. | 13 1 34 26oct2025 15sep1991 . 0 |
+------------------------------------------------------------------+
We have to change each values with the system lacking worth.
. change highbp = . if bpsystol == .
(2 actual modifications made, 2 to lacking)
With that closing change, we examine the reality of our declare as soon as extra. We assert that highbp is the same as 1 when bpsystol just isn’t lacking and larger than or equal to 160.
. assert highbp == 1 if bpsystol >= 160 & bpsystol != .
Our assertion is now true.
That’s how one can examine for duplicates and lacking values and how one can affirm whether or not you’ve got a novel identifier and whether or not statements about your knowledge are in reality true.
