Saturday, January 31, 2026

Net scraping NFL information into Stata


The nfl2stata command not works attributable to web site adjustments.

Soccer season is across the nook, and I couldn’t be extra excited. Now we have a fairly aggressive StataCorp fantasy soccer league. I’m all the time on the lookout for an edge in our league, so I challenged certainly one of our interns, Chris Hassell, to jot down a command to internet scrape http://www.nfl.com for information on the NFL. The brand new command is nfl2stata. To put in the command, sort

internet set up http://www.stata.com/customers/kcrow/nfl2stata, change

With this new command, you may easliy discover the operating backs who had essentially the most touchdowns final season,

. nfl2stata participant "operating again", season(2017) clear
177 commentary(s) loaded

. gsort -touchdowns -yards

. checklist title workforce touchdowns in 1/10

     +-------------------------------------+
     |              title   workforce   touchd~s |
     |-------------------------------------|
  1. |       Todd Gurley     LA         13 |
  2. |       Mark Ingram     NO         12 |
  3. |      Le'Veon Bell    PIT          9 |
  4. |     Jordan Howard    CHI          9 |
  5. | Leonard Fournette    JAX          9 |
     |-------------------------------------|
  6. |       Kareem Hunt     KC          8 |
  7. |     Melvin Gordon    LAC          8 |
  8. |       Carlos Hyde     SF          8 |
  9. |   Latavius Murray    MIN          8 |
 10. |      Alvin Kamara     NO          8 |
     +-------------------------------------+

You could find the top-5 discipline aim kickers (by discipline objectives made) from final season.

. nfl2stata participant "discipline aim kicker", season(2017) clear
54 commentary(s) loaded

. checklist title workforce fieldgoalsmade in 1/5

     +--------------------------------------+
     |               title   workforce   f~lsmade |
     |--------------------------------------|
  1. |       Robbie Gould     SF         39 |
  2. |      Greg Zuerlein     LA         38 |
  3. |    Harrison Butker     KC         38 |
  4. | Stephen Gostkowski     NE         37 |
  5. |        Ryan Succop    TEN         35 |
     +--------------------------------------+

You’ll be able to generate a graph of the highest passing leaders from final common season.

. nfl2stata participant quarterback, season(2017) seasontype(reg) clear
71 commentary(s) loaded

. graph bar (asis) yards if yards >= 4000, exclude0                        ///
over(title, type(yards) descending label(angle(forty_five) labsize(small))) ///
blabel(bar) title(2017 Passing Yard Leaders)

There may be numerous attention-grabbing information to pore via, particularly in case you’re all in favour of fantasy soccer, as I’m. Although this looks as if a easy command, it really shouldn’t be, due to the time it takes to fetch, parse, and cargo the information from http://www.nfl.com by way of internet scraping.

Net scraping

You will have heard of the time period “internet scraping”. A easy definition of internet scraping is extracting information from web sites. More often than not, a web site’s copyright prevents folks from distributing information obtained from scaping their web site, however you should utilize a private copy of the information by yourself private pc. That is what the NFL’s copyright states. Due to this, customers should scrape the web site themselves. To do that for the NFL information, you sort

        nfl2stata scrape, season(_all)

This command will scrape all information from 2009 to the present yr and save the information as Stata datasets to your native pc alongside your Stata adopath. Particularly, it’s going to save them in your PLUS listing the place subsequent nfl2stata instructions will have the ability to discover them. The primary yr of NFL information saved on http://www.nfl.com is 2009. Presently, there aren’t any information to scrape earlier than this. Net scraping is an costly and time-consuming course of. Relying on a number of components (pc velocity, pc reminiscence, community connection, and many others.), this preliminary information scrape can take hours to finish. You may need to run the above command in a single day. Upon getting scraped the historic information, you may simply sort

        nfl2stata scrape

Updating your domestically saved datasets with the present week’s information does run quicker.

As of the writing of this weblog, the scraping command works, but when the NFL adjustments the HTML web page format, the command will break, and if this occurs, we’ll repair it if we are able to. Additionally, the information that’s scraped will change over time because the NFL updates earlier information on its web site, so typically the information you scraped a couple of weeks in the past won’t match what you see on the ESPN or NFL web site. As well as, typically the information can exist in a couple of place and could be inconsistent as one web site will get up to date stats and one other doesn’t. You’ll be able to rescrape the information through the use of nfl2stata scrape, season(_all) change to create new clear datasets. These issues are what makes internet scraping a risky course of.

Command

The command nfl2stata scrape produces recreation, recreation abstract, play-by-play, participant, participant profile, roster, and workforce Stata datasets for annually. To load these information into Stata, you could use the next instructions:

  • To load game-by-game information into Stata, use
            nfl2stata recreation "place" [, game_options]
    
  • To load recreation abstract information into Stata, use
            nfl2stata gamesummary [, game_summary_options]
    
  • To load play-by-play information into Stata, use
            nfl2stata playbyplay [, playbyplay_options]
    
  • To load player-specific information into Stata, use
            nfl2stata participant "place" [, player_options]
    
  • To load participant profile information into Stata, use
            nfl2stata profile [, profile_options]
    
  • To load workforce roster information into Stata, use
            nfl2stata roster [, roster_options]
    
  • To load workforce game-by-game information into Stata, use
            nfl2stata workforce [, team_options]
    

These instructions every search their respective datasets. Typically you will have to make use of Stata instructions like collapse, gsort, and merge to generate the statistics, type the information, and merge two or extra NFL datasets collectively to look at the information. Let’s have a look at a couple of extra examples.

Examples

I’ve discovered that the 2 Stata instructions I take advantage of most steadily with these information are gsort, which types information in ascending or descending order, and collapse, which makes a dataset of abstract statistics. collapse is particularly helpful when working with a number of video games’ or a number of seasons’ information. For instance, to seek out out which broad receiver led the NFL in receiving final yr, you’d sort

. nfl2stata recreation "broad receiver", season(2017) seasontype(reg) clear
2764 commentary(s) loaded

. collapse (sum) receivingyards, by(title)

. gsort -receivingyards

. checklist in 1/5

     +----------------------------+
     |            title   receiv~s |
     |----------------------------|
  1. |   Antonio Brown       1533 |
  2. |     Julio Jones       1444 |
  3. |    Keenan Allen       1393 |
  4. | DeAndre Hopkins       1378 |
  5. |    Adam Thielen       1276 |
     +----------------------------+

Generally, you’ll want to merge two or extra NFL datasets to reply some questions in regards to the information. For instance, to seek out the common weight of an NFL operating again during the last 9 years, you could merge the roster information and the profile information to get the participant place and participant weight variables collectively in the identical dataset. For instance, sort

. nfl2stata roster, clear
18299 commentary(s) loaded

. duplicates drop playerid, power

Duplicates by way of playerid

(13,964 observations deleted)

. drop workforce teamname seasontype

. save temp_roster.dta, change
file temp_roster.dta saved

. nfl2stata profile, clear
4335 commentary(s) loaded

. merge 1:1 playerid utilizing temp_roster.dta

    Outcome                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                             4,335  (_merge==3)
    -----------------------------------------

. sum weight if place == "RB"

    Variable |        Obs        Imply    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |        384    215.9036    14.20637        173        269

To search out who led the NFL in receiving or dashing you’ll want to merge all offensive participant information into one dataset. For instance, to checklist the receiving leaders sort

. nfl2stata recreation "quarterback", season(2017) seasontype(reg) clear
1042 commentary(s) loaded

. tempfile tmp 

. qui save "`tmp'", change

. nfl2stata recreation "operating again", season(2017) seasontype(reg) clear
2018 commentary(s) loaded

. qui append utilizing "`tmp'"

. qui save "`tmp'", change

. nfl2stata recreation "broad receiver", season(2017) seasontype(reg) clear
2764 commentary(s) loaded

. qui append utilizing "`tmp'"

. qui save "`tmp'", change

. nfl2stata recreation "tight finish", season(2017) seasontype(reg) clear
1554 commentary(s) loaded

. qui append utilizing "`tmp'"

. collapse (sum) receivingyards, by(title place)

. gsort -receivingyards

. checklist title place receivingyards in 1/30

     +-------------------------------------------+
     |                title   place   receiv~s |
     |-------------------------------------------|
  1. |       Antonio Brown         WR       1533 |
  2. |         Julio Jones         WR       1444 |
  3. |        Keenan Allen         WR       1393 |
  4. |     DeAndre Hopkins         WR       1378 |
  5. |        Adam Thielen         WR       1276 |
     |-------------------------------------------|
  6. |      Michael Thomas         WR       1245 |
  7. |         Tyreek Hill         WR       1183 |
  8. |    Larry Fitzgerald         WR       1156 |
  9. |        Marvin Jones         WR       1101 |
 10. |      Rob Gronkowski         TE       1084 |
     |-------------------------------------------|
 11. |       Brandin Cooks         WR       1082 |
 12. |          A.J. Inexperienced         WR       1078 |
 13. |        Travis Kelce         TE       1038 |
 14. |         Golden Tate         WR       1003 |
 15. |          Mike Evans         WR       1001 |
     |-------------------------------------------|
 16. |        Doug Baldwin         WR        991 |
 17. |       Jarvis Landry         WR        987 |
 18. |         T.Y. Hilton         WR        966 |
 19. |    Marquise Goodwin         WR        962 |
 20. |    Demaryius Thomas         WR        949 |
     |-------------------------------------------|
 21. |      Robby Anderson         WR        941 |
 22. | JuJu Smith-Schuster         WR        917 |
 23. |       Davante Adams         WR        885 |
 24. |         Cooper Kupp         WR        869 |
 25. |        Stefon Diggs         WR        849 |
     |-------------------------------------------|
 26. |        Kenny Stills         WR        847 |
 27. |      Devin Funchess         WR        840 |
 28. |          Dez Bryant         WR        838 |
 29. |        Alvin Kamara         RB        826 |
 30. |           Zach Ertz         TE        824 |
     +-------------------------------------------+

Implementation

Chris used Stata’s Java plugins to jot down nearly all of the command. The opposite Java libraries he used to jot down the command are

There are numerous Java libraries on the market for internet scraping information. These are simply those we used.



Related Articles

Latest Articles