Friday, April 17, 2026

Merging information, half 2: A number of-key merges


A number of-key merges come up when a couple of variable is required to uniquely establish the observations in your information. In Merging information, half 1, I mentioned single-key merges corresponding to


        . merge 1:1 personid utilizing ...

In that dialogue, every statement within the dataset might be uniquely recognized on the premise of a single variable. In panel or longitudinal datasets, there are a number of observations on every individual or factor and to uniquely establish the observations, we’d like at the least two key variables, corresponding to


        . merge 1:1 personid date utilizing ...

On this dataset we now have repeated observations on individuals and, inside individual, the observations might be uniquely recognized by the calendar date.

Simply to repair concepts, let’s assume I’ve two datasets. The primary, pattern.dta, is the one in every of analytic curiosity to me. I’ve information on 737 individuals. For every individual, I’ve information recorded on the primary and fifteenth of each month, from yr 2000 to 2009. Total, my dataset has 176,880 observations.

The second dataset accommodates further info (variables) on the pattern of curiosity. It has over 3,000 folks in it and it covers an extended stretch of time. I’ve been instructed that the majority of my 737 folks must be on this second dataset, however I’ve been warned that, as a consequence of information assortment or information processing errors over time, a small fraction won’t be discovered.

“What number of?” I requested Bob from information processing throughout a fictional dialog, my paranoia kicking in.

“I don’t know. Most. Ninety-nine p.c. It’s simply random stuff,” he replied, realizing how nicely I reply to the phrase random.

Let’s name this second set of knowledge the payroll information, though if I can think about fictional conversations, you may think about the info are one thing else. They may be information from follow-up visits of a medical experiment.

In any case, I obtain the info, and here’s what occurred once I merged the info with my pattern:


        . use pattern, clear

        . merge 1:1 personid date utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from grasp                     2,352  (_merge==1)
                from utilizing                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

In my pattern information, I’ve 174,520 + 2,352 = 176,872 observations. Of these, 174,528 matched, which is 98.7 p.c. (The rationale that the variety of information from the utilizing (payroll) information that weren’t matched is zero is as a result of I specified possibility preserve(grasp match), which means I discarded the unequalled payroll information. Had I not, the quantity would have been within the low tens of millions.)

For a lot of on this scenario, the story would cease proper right here. Not for me. I wish to present you tear into multiple-key merges to reassure your self that issues actually are as they seem. You understand, in fact, that I manufactured this fictional information for this weblog entry and I buried a little bit one thing that after we discover it, would scare you if this have been an actual story. So I’ll inform you now, this story is loosely primarily based on an actual story.

Step 1: Following my very own recommendation

In Merging information, half 1 I advisable that you simply merge on all frequent variables, not simply the identification variables. This weblog entry will not be going to rehash the earlier weblog entry, however I wish to emphasize that every thing I mentioned within the earlier entry about single-key merges applies equally to multiple-key merges. These two datasets share a variable recording the division during which the worker works, so I’m included it among the many match variables:


        . use pattern, clear

        . merge 1:1 personid date division utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                         2,352  
                from grasp                     2,352  (_merge==1)
                from utilizing                          0  (_merge==2)

            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above matches the output once I merged date and division alone, so I don’t seem to have a merge-gone-bad downside. These merged information are wanting higher and higher.

Step 2: Merge on every key variable in isolation

Let’s think about what might go unsuitable. Think about that every one the info for sure individuals have been lacking, or that every one the info for sure dates have been lacking. Which may not be an issue, however it could definitely increase questions. Relying on the solutions, it could be price a footnote or regarding sufficient to return the info and ask for a refund.

Discovering individuals or dates which can be totally unmatched is a number of work until you already know the next trick: Merge on one key variable at a time.

Let’s begin with personid:


        . use pattern, clear

        . kind personid

        . by personid: preserve if _n==1           // <- do not skip this step
        (176143 observations deleted)

        . merge 1:m personid utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

The output above proves that payroll.dta accommodates information on each individual that seems in pattern.dta.

Let me clarify. I started by utilizing my pattern information and maintaining only one statement for each worth of personid. I do not care which statement I preserve, I simply must preserve one and just one. Then I merged on personid, maintaining (1) the information that match and (2) the information from the grasp that don’t match. I’ve little interest in the ensuing dataset; I simply wished to see the desk merge would report. merge stories that 174,528 personids matched, and 0 didn’t. Ergo, each worth of personid that seems in pattern.dta additionally seems in payroll.dta.

Had merge reported “not matched” = 2, that might imply there would have been two values of personid showing in pattern.dta that don’t seem in payroll.dta. It might not have been an indictment of the info if two individuals weren’t matched of their entirety, however I will surely have appeared into the problem. With the merged lead to reminiscence, I might have typed


        . listing personid if _merge==1
          (output would seem) 

I might have written down the 2 personids listing reported. Then I might have returned to my pattern information and appeared on the information I had on these two folks:


        . use pattern, clear

        . listing if personid==735527
          (output would seem)

        . listing if personid==29887
          (output would seem)

It may be that 735527 was with the corporate for less than a short while and thus the lacking payroll document a plausible random occasion. If 735527 had been with the corporate all ten years, nevertheless, I might be again on the telephone in search of a proof. (If these have been medical information, definitely you’ll wish to understand how an individual who by no means reported for a follow-up go to is understood to nonetheless be alive after ten years.)

A lot for personid. Let’s do the identical for date:


        . use pattern, clear  
        . kind date
        . by date: preserve if _n==1
        (176640 observations deleted)
        . merge 1:m date utilizing payroll, preserve(grasp match)
            Outcome                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           236,832  (_merge==3)
            -----------------------------------------

Each date that seems in pattern.dta additionally seems in payroll.dta.

Lastly, let’s take a look at division:


        . use pattern, clear

        . kind division date

        . by division date: preserve if _n==1
        (175200 observations deleted)

        . merge 1:m division date utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                            24
                from grasp                        24  (_merge==1)
                from utilizing                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Each division that seems in pattern.dta seems in payroll.dta

These information are wanting higher and higher.

If we had solely two key variables, we might be finished. We, nevertheless, are performing the total merge on three variables, particularly personid, date, and division, and so there’s yet another set of comparisons we should always look at.

Step 3: Merge on each pair of key variables

With three key variables, the potential pairs are (personid, date), (personid, division), and (division, date). We’ve already checked out (personid, date), in order that simply leaves (personid, division) and (division, date).

The strategy is identical as in Step 2 besides that we sort two variables the place we beforehand typed one:


        . use pattern, clear

        . kind personid division

        . by personid division: preserve if _n==1
        (176143 observations deleted)

        . merge 1:m personid division utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                             0
            matched                           174,528  (_merge==3)
            -----------------------------------------

We uncover that each personid-division mixture that seems in pattern.dta additionally seems in payroll.dta.

Final is (division, date):


        . use pattern, clear

        . kind division date

        . by division date: preserve if _n==1
        (175200 observations deleted)

        . merge 1:m division date utilizing payroll, preserve(grasp match)

            Outcome                           # of obs.
            -----------------------------------------
            not matched                            24
                from grasp                        24  (_merge==1)
                from utilizing                          0  (_merge==2)

            matched                           236,832  (_merge==3)
            -----------------------------------------

Shock! Nicely, you are not shocked as a result of I instructed you earlier we have been going to search out one thing, but when this have been actual life, you’ll be shocked in any case these reassuring outcomes.

We uncover that there are 24 division-date combos showing in pattern.dta that don’t seem in payroll.dta. Let us take a look at the 24 lacking combos:


        . listing division date if _merge==1

                +----------------------+
                | division        date |
                |----------------------|
          1129. |        5   01jan2007 |
          1130. |        5   15jan2007 |
          1131. |        5   01feb2007 |
          1132. |        5   15feb2007 |
          1133. |        5   01mar2007 |
                |----------------------|
          1134. |        5   15mar2007 |
          1135. |        5   01apr2007 |
          1136. |        5   15apr2007 |
          1137. |        5   01may2007 |
          1138. |        5   15may2007 |
                |----------------------|
          1139. |        5   01jun2007 |
          1140. |        5   15jun2007 |
          1141. |        5   01jul2007 |
          1142. |        5   15jul2007 |
          1143. |        5   01aug2007 |
                |----------------------|
          1144. |        5   15aug2007 |
          1145. |        5   01sep2007 |
          1146. |        5   15sep2007 |
          1147. |        5   01oct2007 |
          1148. |        5   15oct2007 |
                |----------------------|
          1149. |        5   01nov2007 |
          1150. |        5   15nov2007 |
          1151. |        5   01dec2007 |
          1152. |        5   15dec2007 |
                +----------------------+

When you look carefully, you’ll discover that each payroll date in 2007 is listed. So what occurred to the payroll information for division 5 in 2007? This may increasingly certainly be precisely the type of random occasion that Bob had in thoughts throughout our fictional dialog. In some way the corporate misplaced a little bit cluster of payroll information. The loss might imply imply nothing. Or it may be of vital significance. Think about there’s been an allegation that the corporate treats older staff poorly and picture that division 5 has the very best common age. Not random. Not random in any respect.

Step 4: Merge on each triplet of key variables

A lot for the fictional story.

If we had 4 or extra key variables, we might now must merge on each triplet of key variables, and if we had 5 or extra key variables, we then must merge on each quadruplet of key variables, and if …

Overlook the story. Or bear in mind it if it scares you. Information processing and paranoia make a superb pairing. What’s necessary is how simple it’s to take difficult, multiple-key merges aside. I’ve by no means met anybody but who knew this trick.



Related Articles

Latest Articles