Thursday, May 14, 2026

Important instruments for knowledge high quality checks


Earlier than we match statistical fashions with our datasets, we sometimes undergo a couple of checks to substantiate that our knowledge are correct and full. No matter whether or not you’ve got obtained knowledge from a company or constructed the dataset your self, it’s worthwhile to examine for knowledge entry errors. Beneath, we are going to present you 4 important Stata instructions for performing high quality checks in your knowledge: duplicates, isid, assert, and misstable.

Duplicates

Now we have fictional knowledge on sufferers that underwent corrective eye surgical procedure. For every affected person we’ve got an identification quantity, the date they have been admitted for surgical procedure, their age and intercourse, and their systolic blood stress.

. use datacheck1

. describe

Accommodates knowledge from datacheck1.dta
 Observations:            19
    Variables:             7                  22 Apr 2026 12:06
-----------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Show    Worth
    identify         sort    format    label      Variable label
-----------------------------------------------------------------------------------------------------------------------------------
patient_id      float   %9.0g                 Affected person ID
intercourse             float   %9.0g                 Intercourse
age             float   %9.0g                 Age
surgery_date    float   %td                   Surgical procedure date
birth_date      float   %td                   Start date
bpsystol        float   %9.0g                 Systolic BP
highbp          float   %9.0g                 BP 160+
-----------------------------------------------------------------------------------------------------------------------------------
Sorted by:

Sufferers can have the surgical procedure solely as soon as, with an enhancement years later. Subsequently, we first need to make it possible for our info on affected person IDs and dates is right. We start by checking for any duplicate observations.

. duplicates report

Duplicates when it comes to all variables

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           15             0
        2 |            4             2
--------------------------------------

Out of the 19 observations on this dataset, 15 are distinctive. For these 15 observations, we’ve got a single copy of the data. Nevertheless, there are 4 observations which might be duplicates. There are two sufferers for which we’ve got two copies. We checklist them beneath:

. duplicates checklist

Duplicates when it comes to all variables

  +--------------------------------------------------------------------------------+
  | Group   Obs   patien~d   intercourse   age   surgical procedure~e   birth_d~e   bpsystol   highbp |
  |--------------------------------------------------------------------------------|
  |     1     4          3     1    51   09sep2025   13aug1974        135        0 |
  |     1     5          3     1    51   09sep2025   13aug1974        135        0 |
  |     2     8          6     1    38   18nov2025   10oct1987        125        0 |
  |     2     9          6     1    38   18nov2025   10oct1987        125        0 |
  +--------------------------------------------------------------------------------+

Now we have two copies of affected person IDs 3 and 6; we are able to see that the identical info is repeated for all variables. Beneath, we drop the duplicates.

. duplicates drop

Duplicates when it comes to all variables

(2 observations deleted)

A community-contributed command that can also be helpful is distinct; this command will report the variety of distinct values for a number of variables. You too can report the variety of distinct teams outlined by a number of variables, such because the variety of distinctive teams outlined by affected person ID and surgical procedure date. Sort search distinct to study extra, and observe the directions to put in it if you want to make use of this command.

Distinctive identifiers

With the duplicates eliminated, we now examine whether or not observations are uniquely recognized by the mix of affected person ID and surgical procedure date; if they’re, isid will report nothing. We use the prefix seize to seize a return code in case isid does produce an error; that is helpful when positioned in do-files as a result of it permits your do-file to proceed to run regardless of any errors. We additionally use the noisily prefix so we are able to see the error message.

. seize noisily: isid patient_id surgery_date, kind
variables patient_id and surgery_date don't uniquely determine the observations

We see that patient_id and surgery_date don’t uniquely determine observations. Let’s examine whether or not we’ve got any duplicates for affected person ID:

. duplicates report patient_id

Duplicates when it comes to patient_id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           13             0
        2 |            4             2
--------------------------------------

These duplicates are observations which have the identical worth for affected person ID however totally different values for different variables; in any other case, they might have been reported in our prior name to duplicates report. Let’s take a better take a look at the duplicates:

. duplicates checklist patient_id 

Duplicates when it comes to patient_id

  +------------------------+
  | Group   Obs   patien~d | 
  |------------------------|
  |     1     1          1 | 
  |     1     2          1 | 
  |     2    10          9 |
  |     2    11          9 |
  +------------------------+

. checklist if patient_id == 1 | patient_id == 9, abbrev(14)

     +------------------------------------------------------------------------+      
     | patient_id   intercourse   age   surgery_date   birth_date   bpsystol   highbp |
     |------------------------------------------------------------------------|
  1. |          1     0    34      15mar2020    10feb1986        163        1 |
  2. |          1     0    39      20mar2025    10feb1986        165        1 |
 10. |          9     1    45      25sep2025    20jun1980        140        0 |
 11. |          9     0    47      25sep2025    17jul1978        135        0 |
     +------------------------------------------------------------------------+

Observations 1 and a couple of each have affected person IDs equal to 1; they’ve the identical worth for birth_date and intercourse. This appears to be the identical affected person; they initially had surgical procedure in 2020 and visited in 2025 for a touch-up. Subsequently, these two observations are duplicates for patient_id however not really duplicates as a result of they differ for different variables, like age and surgery_date. For some knowledge functions, you might need to drop some of these observations; you can achieve this by typing the next:

duplicates drop patient_id, drive

The drive choice is required right here since you are dropping observations which might be duplicates when it comes to one variable however which might be distinctive based mostly on values of different variables. If we have been to difficulty this command, we’d be dropping details about this affected person’s enhancement surgical procedure, which we don’t need to do; subsequently, bear in mind that you’re dropping knowledge when dropping some of these observations.

We additionally see that observations 10 and 11 each have affected person IDs equal to 9. They’ve the identical surgical procedure date however totally different values for start date and intercourse, so this appears to be an information entry error. We have to change the affected person ID for certainly one of these observations to a different worth; let’s examine the present vary of ID numbers.

. codebook patient_id
     
-----------------------------------------------------------------------------------------------------------------------------------
patient_id                                                                                                               Affected person ID
-----------------------------------------------------------------------------------------------------------------------------------

                  Sort: Numeric (float)

                 Vary: [1,15]                        Items: 1 
         Distinctive values: 15                        Lacking .: 0/17

                  Imply: 7.64706
             Std. dev.: 4.52688

           Percentiles:     10%       25%       50%       75%       90%
                              1         4         8        11        14

. change patient_id = 16 in 11 
(1 actual change made)

Now we have affected person IDs starting from 1 to fifteen. To make it possible for the ID quantity is exclusive to every affected person, we are able to change the affected person ID to 0 or 16; we select 16.

codebook is helpful for checking the vary, models, and variety of lacking values for a variable. If you’d like a better take a look at the frequency for every worth, think about using fre; this community-contributed command creates one-way frequency tables, and it’s particularly helpful if you’re utilizing worth labels. For instance, you may need to examine what number of observations there are per county; fre would show the county quantity and label, similar to ‘‘1 Los Angeles’’ and ‘‘2 Bronx’’. Sort search fre to study extra, and observe the directions to put in it if you want to make use of this command.

We now run isid as soon as extra to substantiate that we are able to uniquely determine every affected person.

. isid patient_id surgery_date

Nothing is reported. We will affirm that we’ve got one remark per affected person and surgical procedure date.

Confirm fact of declare

Subsequent, we need to make it possible for our variable highbp was coded accurately. We take into account systolic blood pressures of 160, or larger, to be excessive. Let’s affirm that we’ve got a price of 1 for highbp for observations with a systolic blood stress of not less than 160. We specify the expression that highbp is the same as 1 when bpsystol is larger than or equal to 160; if the assertion is true for all observations, assert will report nothing. Nevertheless, if it isn’t true, even for only one remark, the output will tell us that it’s false.

. seize noisily: assert highbp == 1 if bpsystol >= 160
2 contradictions in 9 observations
assertion is fake

assert checks whether or not our expression is true for every remark, and it experiences that there are 2 contradictions. In case you are working with a big dataset, think about using the quick choice, which forces assert to cease on the first contradiction. This manner, you don’t have to attend whereas assert checks each remark.

There are two observations for which our expression is fake. This could be as a result of systolic blood stress was in reality low however highbp was mistakenly coded as 1 or as a result of blood stress was excessive however highbp was mistakenly coded as 0. We examine for each beneath.

. checklist if highbp == 1 & bpsystol <= 160  

. checklist if highbp == 0 & bpsystol >= 160

     +------------------------------------------------------------------+
     | patien~d   intercourse   age   surgical procedure~e   birth_d~e   bpsystol   highbp |
     |------------------------------------------------------------------|
 13. |       11     0    24   12dec2025   11nov2001        179        0 |
 15. |       13     1    34   26oct2025   15sep1991          .        0 |
     +------------------------------------------------------------------+

For remark 13, highbp ought to as a substitute have been coded as 1. We make that change beneath.

. change highbp = 1 in 13 
(1 actual change made)

Examine for lacking values

For remark 15, the blood stress was lacking, so highbp must be lacking too. Let’s see what number of lacking values we’ve got in our dataset.

. misstable summarize
                                                               Obs<.
                                                +------------------------------ 
               |                                | Distinctive
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
      bpsystol |         2                  15  |     12        115         187
  -----------------------------------------------------------------------------

The variable bpsystol is the one one with lacking values. Let’s make it possible for highbp is lacking for the opposite remark for which bpsystol can also be lacking.

. checklist if lacking(bpsystol)

     +------------------------------------------------------------------+
     | patien~d   intercourse   age   surgical procedure~e   birth_d~e   bpsystol   highbp |
     |------------------------------------------------------------------|
 14. |       12     1    26   14nov2025   10oct1999          .        1 |
 15. |       13     1    34   26oct2025   15sep1991          .        0 |
     +------------------------------------------------------------------+

We have to change each values with the system lacking worth.

. change highbp = . if bpsystol == . 
(2 actual modifications made, 2 to lacking)

With that closing change, we examine the reality of our declare as soon as extra. We assert that highbp is the same as 1 when bpsystol just isn’t lacking and larger than or equal to 160.

. assert highbp == 1 if bpsystol >= 160 & bpsystol != .

Our assertion is now true.

That’s how one can examine for duplicates and lacking values and how one can affirm whether or not you’ve got a novel identifier and whether or not statements about your knowledge are in reality true.



Previous article

Related Articles

Latest Articles