Monday, December 1, 2025

Import COVID-19 information from Johns Hopkins College


Like lots of you, I’m working from house and checking the most recent information on COVID-19 often. I see a variety of numbers and graphs, so I appeared round for the “official information”. Probably the greatest information sources I’ve discovered is on the GitHub web site for Johns Hopkins Whiting College of Engineering Middle for Techniques Science and Engineering. The info for every day are saved in a separate file, so I wrote just a little Stata command referred to as covid19 to obtain, mix, save, and graph these information.

The command creates a desk that incorporates the date, the variety of confirmed instances, the variety of deaths, and the quantity recovered. I added a calculated variable named newcases, which is the distinction between confirmed instances for 2 contiguous days.

The unique information embody all reporting nations, so I added a nation() choice that permits me to view the desk for a selected nation.

. covid19, nation("US") graph saving(covid19_usa)

   COVID-19 information for US as of 19 Mar 2020

  +--------------------------------------------------------+
  |       date   confirmed   newcases   deaths   recovered |
  |--------------------------------------------------------|
  | 01/22/2020           1          1        0           0 |
  | 01/23/2020           1          0        0           0 |
  | 01/24/2020           2          1        0           0 |
  | 01/25/2020           2          0        0           0 |

                     (Output omitted)

  | 03/14/2020        2726        547       54          12 |
  | 03/15/2020        3499        773       63          12 |
  | 03/16/2020        4632       1133       85          17 |
  | 03/17/2020        6421       1789      108          17 |
  | 03/18/2020        7786       1365      118         106 |
  +--------------------------------------------------------+

I additionally added a graph choice that plots the variety of confirmed instances.

Determine 1: Confirmed instances of COVID-19 in the USA

The saving() choice saves the information to a file named covid19_usa.dta.

. use covid19_usa, clear

. listing in -10/l

     +------------------------------------------------------+
     |       date   confir~d   newcases   deaths   recove~d |
     |------------------------------------------------------|
 48. | 03/09/2020        605         68       22          8 |
 49. | 03/10/2020        959        354       28          8 |
 50. | 03/11/2020       1281        322       36          8 |
 51. | 03/12/2020       1663        382       40         12 |
 52. | 03/13/2020       2179        516       47         12 |
     |------------------------------------------------------|
 53. | 03/14/2020       2726        547       54         12 |
 54. | 03/15/2020       3499        773       63         12 |
 55. | 03/16/2020       4632       1133       85         17 |
 56. | 03/17/2020       6421       1789      108         17 |
 57. | 03/18/2020       7786       1365      118        106 |
     +------------------------------------------------------+

I like my covid19 command, however I’ve no plans to distribute it. I haven’t examined it totally, and it may very well be rendered ineffective if or when Johns Hopkins modifications its information. So I’m going to indicate you the steps to obtain and course of the information so you are able to do this your self. If the information change, you’ll have the instruments to switch your code and obtain future information.

Johns Hopkins College GitHub information

GitHub is a well-liked web site/platform for software program improvement and distribution. The Johns Hopkins Whiting College of Engineering Middle for Techniques Science and Engineering has a GitHub repository that features usually up to date COVID-19 information from all over the world. The uncooked information can be found right here: Johns Hopkins GitHub information.

Determine 2: Johns Hopkins GitHub information for COVID-19

graph1

The info are saved in separate .csv information for every day. Let’s click on on the file named 01-29-2020.csv to view its contents. This file incorporates information for the Province/State, the Nation/Area, the date and time of the Final Replace, the variety of Confirmed instances, the variety of Deaths, and the quantity Recovered.

Determine 3: Contents of 01-29-2020.csv

graph1

You’ll be able to view the uncooked, comma-delimited information by clicking on the “Uncooked” button subsequent to the pink arrow. Word that I added the pink arrow; it doesn’t seem on the web site.

Determine 4: Uncooked comma-delimited information in 01-29-2020.csv

graph1

You’ll be able to import these uncooked information from the web site into Stata utilizing import delimited. The filename is the URL within the deal with bar of your internet browser.

import delimited "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv"

Your internet browser will seemingly wrap the URL to a second line. Word that it is a one-line command in Stata. I can then describe and listing the information in Stata.

. describe

Incorporates information
  obs:            54
 vars:             6
------------------------------------------------------------------------
              storage   show    worth
variable title   sort    format     label      variable label
------------------------------------------------------------------------
ïprovincestate  str16   %16s                  Province/State
countryregion   str20   %20s                  Nation/Area
lastupdate      str13   %13s                  Final Replace
confirmed       int     %8.0g                 Confirmed
deaths          int     %8.0g                 Deaths
recovered       byte    %8.0g                 Recovered
------------------------------------------------------------------------
Sorted by:
     Word: Dataset has modified since final saved.

. listing in 1/5
   +---------------------------------------------------------------------------+
   | ïprovin~e    countryregion      lastupdate   confir~d   deaths   recove~d |
   |---------------------------------------------------------------------------|
1. |     Hubei   Mainland China   1/29/20 19:30       3554      125         88 |
2. |  Zhejiang   Mainland China   1/29/20 19:30        296        .          3 |
3. | Guangdong   Mainland China   1/29/20 19:30        277        .          5 |
4. |     Hunan   Mainland China   1/29/20 19:30        221        .          . |
5. |     Henan   Mainland China   1/29/20 19:30        206        2          1 |
   +---------------------------------------------------------------------------+

We did it! We efficiently imported the uncooked information for January 29, 2020, from the Johns Hopkins GitHub repository into Stata! It’s just one file, and there may be no less than one apparent subject with the information (the variable named ïprovincestate), nevertheless it’s a very good place to begin.
Utilizing verify and if to repair variable names

I realized by way of trial and error that the variable provincestate is imported as ïprovincestate in some information and provincestate in others. We need to import many information, and we don’t need to should look by way of every file manually. So let’s use verify to verify every file for a variable named ïprovincestate.

. verify variable ïprovincestate

. show _rc
0

If the variable ïprovincestate exists, the command will retailer a worth of zero to _rc. We will use this end in an if assertion that can rename and appropriately label the variable.

if _rc == 0 {
   rename ïprovincestate provincestate
   label variable provincestate "Province/State"
}

This dataset incorporates the variable ïprovincestate, so the instructions contained in the if block are executed. We will verify this utilizing describe.

. describe provincestate

              storage   show    worth
variable title   sort    format     label      variable label
------------------------------------------------------------------------
provincestate   str16   %16s                  Province/State

Our dataset now incorporates a variable named provincestate. Let’s sort verify variable once more to see what occurs if the variable ïprovincestate shouldn’t be in our dataset.

. verify variable ïprovincestate
variable ïprovincestate not discovered
r(111);

This dataset doesn’t comprise a variable named ïprovincestate, so verify variable returns an error message. If we execute verify variable inside a do-file, the do-file will cease working. We will use seize to permit our do-file to proceed working even when verify variable returns an error.

. seize verify variable ïprovincestate

. show _rc
111

Word that _rc equals 111, which isn’t equal to 0. So the code under will import the uncooked information and “repair” ïprovincestate if obligatory.

import delimited "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv"
if _rc == 0 {
   rename ïprovincestate provincestate
   label variable provincestate "Province/State"
}

Utilizing macros to import completely different information

Subsequent let’s see how macros might help us import completely different information from the web site. We will retailer phrases or numbers quickly in reminiscence utilizing macros. Macros might be native or world. International macros, as soon as outlined, are obtainable anyplace in Stata. Native macros exist solely throughout the program or do-file wherein they’re outlined. You’ll be able to outline an area macro like this:

. native right now = "3-19-2020"

You’ll be able to consult with the macro by typing a left single quote (to the left of the “one key”) on the left aspect of the macro title and a proper single quote (to the left of the “Enter” key) on the fitting aspect of the macro title.

. show "`right now'"
3-19-2020

We will create macros by combining different macros like this:

. native month = "3"

. native day = "19"

. native yr = "2020"

. native right now = "`month'-`day'-`yr'"

. show "`right now'"
3-19-2020

We want to import information with names like 01-29-2020.csv. Word that the month features a “main zero”. Months 1–9 are specified as “01”, “02”, and so forth. Months 10–12 are specified as “10”, “11”, and “12”. We want a approach to assign values to native macros that features a main zero when applicable. The string() perform is a technique to do that.

. native month = string(3, "%02.0f")

. native day = string(19, "%02.0f")

. native yr = "2020"

. native right now = "`month'-`day'-`yr'"

. show "`right now'"
03-19-2020

The string() perform for the macro month saved the quantity 3 to a two-character string with a number one zero. The string() perform for the macro day saved the quantity 19 to a two-character string with a number one zero, however the main zero shouldn’t be obligatory. This may permit us to specify arbitrary dates with the right format for the information we want to import.

Subsequent, let’s contemplate the whole filename. Many of the prolonged filename is identical for each file. Let’s retailer the URL for the filenames to an area macro.

. native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/  
> COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"

. show "`URL'"
https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/
> csse_covid_19_data/csse_covid_19_daily_reports/

This permits us to specify a filename like this:

. native FileName = "`URL'`right now'.csv"

. import delimited "`FileName'"
(8 vars, 284 obs)

Utilizing loops to import a number of information

We will repeat a command or assortment of instructions by inserting them in a loop. A method to do that is with forvalues.

forvalues month = 1/12 {
   show "month = `month'"
}

Word that month is saved as an area macro, and we will consult with it as `month’ contained in the loop. So the outcomes of our loop appear like this:

. forvalues month = 1/12 {
  2.     show "month = `month'"
  3. }
month = 1
month = 2
month = 3
month = 4
month = 5
month = 6
month = 7
month = 8
month = 9
month = 10
month = 11
month = 12

We will additionally create a loop inside a loop. For instance, we may loop over days inside months like this:

forvalues month = 1/12 {
   forvalues day = 1/31 {
      show "month = `month', day = `day'"
   }
}

The outcomes of this loop are fairly prolonged, so I’ve omitted a lot of the output.

. forvalues month = 1/12 {
  2.     forvalues day = 1/31 {
  3.         show "month = `month', day = `day'"
  4.     }
  5. }
month = 1, day = 1
month = 1, day = 2
month = 1, day = 3
month = 1, day = 4

(Output omitted)

month = 12, day = 28
month = 12, day = 29
month = 12, day = 30
month = 12, day = 31

Let’s use our loop to create names for the information we want to obtain.

forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      show "`right now'"
   }
}

This loop additionally returns a variety of output, so I’ve omitted most of it for brevity. Word that the string() perform provides main zeros for month and day the place obligatory.

. forvalues month = 1/12 {
  2.     forvalues day = 1/31 {
  3.         native month = string(`month', "%02.0f")
  4.         native day   = string(`day', "%02.0f")
  5.         native yr  = "2020"
  6.         native right now = "`month'-`day'-`yr'"
  7.         show "`right now'"
  8.     }
  9. }
01-01-2020
01-02-2020
01-03-2020
01-04-2020

(Output omitted)

12-28-2020
12-29-2020
12-30-2020
12-31-2020

We will use the macro URL together with import delimited to import the uncooked information for day-after-day in 2020. We might want to clear Stata’s reminiscence earlier than we import every uncooked information file. Then, we will save a replica of every file as a Stata dataset.

native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      native FileName = "`URL'`right now'.csv"
      clear
      import delimited "`FileName'"
      save "`right now'"
   }
}

I’m positive you might have already anticipated the flaw with the code above. It loops over day-after-day of each month of 2020. However the information information start on January 21, 2020, so there isn’t a information file for January 1. Our program returns an error message.

. native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/  
> COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"

. forvalues month = 1/12 {
  2.     forvalues day = 1/31 {
  3.         native month    = string(`month', "%02.0f")
  4.         native day      = string(`day', "%02.0f")
  5.         native yr     = "2020"
  6.         native right now    = "`month'-`day'-`yr'"
  7.         native FileName = "`URL'`right now'.csv"
  8.         clear
  9.         import delimited "`FileName'"
 10.         save "`right now'"
 11.     }
 12. }
file
    https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/
    > csse_covid_19_data/csse_covid_19_daily_reports/01-01-2020.csv not discovered
couldn't open url
r(603);

New information may even be added day-after-day, and I want to import future information as soon as they’re posted. There are numerous intelligent methods to deal with this example. A method is to easily precede import delimited and save with seize. Information information shall be imported in the event that they exist. If the information file doesn’t exist, this system will proceed. It’s a gradual and crude methodology, however it’ll work for our functions.

native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      native FileName = "`URL'`right now'.csv"
      clear
      seize import delimited "`FileName'"
      seize save "`right now'"
   }
}

Our seize trick labored, and I can view a listing of the Stata information information in my native listing.

. ls
     3/19/20 16:23  .
     3/19/20 16:23  ..
   6.0k   3/19/20 16:23  01-22-2020.dta
   6.3k   3/19/20 16:23  01-23-2020.dta
   6.1k   3/19/20 16:23  01-24-2020.dta
   6.2k   3/19/20 16:23  01-25-2020.dta

         (Output omitted)

  30.0k   3/19/20 16:23  03-15-2020.dta
  31.3k   3/19/20 16:23  03-16-2020.dta
  31.7k   3/19/20 16:23  03-17-2020.dta
  32.4k   3/19/20 16:23  03-18-2020.dta

Let’s open the file 01-22-2020.dta and describe the information.

. describe utilizing 01-22-2020.dta

Incorporates information
  obs:            38                          19 Mar 2020 16:45
 vars:             6
------------------------------------------------------------------------
              storage   show    worth
variable title   sort    format     label      variable label
------------------------------------------------------------------------
ïprovincestate  str14   %14s                  Province/State
countryregion   str14   %14s                  Nation/Area
lastupdate      str15   %15s                  Final Replace
confirmed       int     %8.0g                 Confirmed
deaths          byte    %8.0g                 Deaths
recovered       byte    %8.0g                 Recovered
------------------------------------------------------------------------
Sorted by:

That is a kind of information that comprise the variable ïprovincestate. We realized learn how to repair this earlier utilizing verify and if, so let’s put our “repair” in our code and rerun it. Word that I’ve added the choice , exchange to the save instructions as a result of the information exist already in our listing.

native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      native FileName = "`URL'`right now'.csv"
      clear
      seize import delimited "`FileName'"
      seize verify variable ïprovincestate
      if _rc == 0 {
         rename ïprovincestate provincestate
         label variable provincestate "Province/State"
      }
   seize save "`right now'", exchange
   }
}

Now we will describe the dataset 01-22-2020.dta and confirm that the variable ïprovincestate has been modified to provincestate.

. describe utilizing 01-22-2020.dta

Incorporates information
  obs:            38                          19 Mar 2020 16:35
 vars:             6
------------------------------------------------------------------------
              storage   show    worth
variable title   sort    format     label      variable label
------------------------------------------------------------------------
provincestate   str14   %14s                  Province/State
countryregion   str14   %14s                  Nation/Area
lastupdate      str15   %15s                  Final Replace
confirmed       int     %8.0g                 Confirmed
deaths          byte    %8.0g                 Deaths
recovered       byte    %8.0g                 Recovered
------------------------------------------------------------------------
Sorted by:

Our native listing now incorporates a set of Stata information information that we want to mix right into a single file. We will do that utilizing append. We may append two information information manually like this:

. clear

. append utilizing 01-22-2020.dta

. append utilizing 01-23-2020.dta

However that might be tedious, and we must frequently replace our code. One other loop would permit us to automate this course of. Word that I’ve used the seize trick earlier than append as a result of we don’t have a file for day-after-day of 2020.

clear
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      seize append utilizing "`right now'"
      }
}

I’ve described our information under and see that there are two new variables. Among the uncooked information information comprise the variables latitude and longitude, and so they have been appended to our full dataset.

. describe

Incorporates information
  obs:         6,722
 vars:             8
------------------------------------------------------------------------
              storage   show    worth
variable title   sort    format     label      variable label
------------------------------------------------------------------------
provincestate   str43   %43s                  Province/State
countryregion   str32   %32s                  Nation/Area
lastupdate      str19   %19s                  Final Replace
confirmed       lengthy    %8.0g                 Confirmed
deaths          int     %8.0g                 Deaths
recovered       lengthy    %8.0g                 Recovered
latitude        float   %9.0g                 Latitude
longitude       float   %9.0g                 Longitude
------------------------------------------------------------------------
Sorted by:
     Word: Dataset has modified since final saved.

A listing of the final 10 observations confirms that the dataset consists of some information for latitude and longitude.

. listing countryregion latitude longitude in -10/l

      +---------------------------------------------+
      |         countryregion   latitude   longit~e |
      |---------------------------------------------|
7917. |                France   -12.8431    45.1383 |
7918. |                France   -21.1351    55.2471 |
7919. |             Greenland         72        -40 |
7920. |                  Guam    13.4443   144.7937 |
7921. |              Guernsey      49.45      -2.58 |
      |---------------------------------------------|
7922. |                Jersey      49.19      -2.11 |
7923. |           Puerto Rico       18.2      -66.5 |
7924. | Republic of the Congo      -1.44     15.556 |
7925. |           The Bahamas      24.25        -76 |
7926. |            The Gambia    13.4667      -16.6 |
      +---------------------------------------------+

Conclusion and objectives for subsequent time

Let’s cease right here and take into consideration what we’ve achieved. We now have efficiently downloaded all of the uncooked information obtainable for COVID-19 from the Johns Hopkins GitHub repository and mixed them right into a single dataset utilizing the next code:

native URL = "https://uncooked.githubusercontent.com/CSSEGISandData/COVID-19/grasp/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      native FileName = "`URL'`right now'.csv"
      clear
      seize import delimited "`FileName'"
      seize verify variable ïprovincestate
      if _rc == 0 {
         rename ïprovincestate provincestate
         label variable provincestate "Province/State"
      }
      seize save "`right now'", exchange
   }
}
clear
forvalues month = 1/12 {
   forvalues day = 1/31 {
      native month = string(`month', "%02.0f")
      native day = string(`day', "%02.0f")
      native yr = "2020"
      native right now = "`month'-`day'-`yr'"
      seize append utilizing "`right now'"
   }
}

A lot of you’ll discover that we may accomplish this job with a single loop (trace: use frames!). However I wished to maintain issues so simple as potential on this put up. Let’s save this dataset so we will use it later.

. save covid19_raw
file covid19_raw.dta saved

There are numerous issues we may do with this uncooked dataset. We may create time-series information to look at patterns over time. We may do that by nation and even provinces or states inside nations. And we’ve details about latitude and longitude for a few of the information, so we may discover patterns by time and placement. I’ll talk about these in future posts.

Please be aware that we’ve not checked and cleaned these information. The code above and the ensuing information needs to be used for educational functions solely.



Related Articles

Latest Articles