Sunday, December 14, 2025

Net scraping NBA information into Stata


As of November 2019, this command now not works due to https://stats.nba.com restrictions.

Since our intern, Chris Hassell, completed nfl2stata sooner than anticipated, he went forward and created one other command to internet scrape https://stats.nba.com for information on the NBA. The command is nba2stata. To put in the command kind

internet set up http://www.stata.com/customers/kcrow/nba2stata, substitute

When Chris first wrote the command, I knew I needed to have a look at how the three-point shot has modified the best way the sport is performed. For instance, I can discover the most effective three-point shooter from final season.

. nba2stata playerstats _all, season(2017) seasontype(reg) stat(season) clear
Processing x/543 requests
.........x.........x.........x.........x.........50
.........x.........x.........x.........x.........100
.........x.........x.........x.........x.........150
.........x.........x.........x.........x.........200
.........x.........x.........x.........x.........250
.........x.........x.........x.........x.........300
.........x.........x.........x.........x.........350
.........x.........x.........x.........x.........400
.........x.........x.........x.........x.........450
.........x.........x.........x.........x.........500
.........x.........x.........x.........x...
660 remark(s) loaded

. gsort -threepointfieldgoalsmade

. checklist playername teamname threepointfieldgoalsmade in 1/10

     +----------------------------------------------------+
     |      playername                teamname   three~de |
     |----------------------------------------------------|
  1. |    James Harden         Houston Rockets        265 |
  2. |     Paul George   Oklahoma Metropolis Thunder        244 |
  3. |      Kyle Lowry         Toronto Raptors        238 |
  4. |    Kemba Walker       Charlotte Hornets        231 |
  5. |   Klay Thompson   Golden State Warriors        229 |
     |----------------------------------------------------|
  6. | Wayne Ellington              Miami Warmth        227 |
  7. |  Damian Lillard   Portland Trailblazers        227 |
  8. |     Eric Gordon         Houston Rockets        218 |
  9. |   Stephen Curry   Golden State Warriors        212 |
 10. |      Joe Ingles               Utah Jazz        204 |
     +----------------------------------------------------+

Or I can test a participant’s regular-season three-point proportion for the final 5 years.

. nba2stata playerstat "Dirk", stat(season) seasontype(reg) clear
27 remark(s) loaded

. gsort -playerage 

. checklist playername playerage threepointfieldgoalpercentage in 1/5

     +-------------------------------------+
     |    playername   playe~ge   three~ge |
     |-------------------------------------|
  1. | Dirk Nowitzki         40       .409 |
  2. | Dirk Nowitzki         39       .378 |
  3. | Dirk Nowitzki         38       .368 |
  4. | Dirk Nowitzki         37        .38 |
  5. | Dirk Nowitzki         36       .398 |
     +-------------------------------------+

Or I can see how three-point proportion impacts your favourite crew’s likelihood of profitable.

. nba2stata teamstats "HOU", season(2017) stat(sport) seasontype(reg) clear
82 remark(s) loaded

. hold if threepointfieldgoalpercentage > .35
(35 observations deleted)

. tab winloss

 Win / loss |      Freq.     P.c        Cum.
------------+-----------------------------------
          L |          4        8.51        8.51
          W |         43       91.49      100.00
------------+-----------------------------------
      Whole |         47      100.00

nba2stata is nice in case you are planning on doing professional basketball evaluation. Though this command appears to be like similar to nfl2stata, it’s not. The command works fairly in a different way.

Net scraping JSON

In our final weblog put up, we talked about internet scraping the https://www.nfl.com and extracting the info from the HTML pages. The NBA information are totally different. You may entry the info through JSON objects from https://stats.nba.com. JSON is a light-weight information format. This information format is simple to parse; subsequently, we don’t have a scrape command for these information. We scrape and cargo these information on the fly.

The NBA’s copyright is much like that of the NFL; you should use a private copy of the info by yourself private laptop. When you “use, show or publish” something utilizing these information, you should embrace “a outstanding attribution to http://www.nba.com“. One other distinction is that the NBA information saved on http://stats.nba.com can go way back to the Nineteen Sixties, relying on the crew.

Command

There are solely 4 subcommands to nba2stata, although we may have developed extra. Chris had to return to high school.

  • To scrape participant statistics information into Stata, use
    nba2stata playerstats name_pattern [, playerstats_options]
    
  • To scrape participant profile information into Stata, use
    nba2stata playerprofile name_pattern [, playerprofile_options]
    
  • To scrape crew statistics information into Stata, use
    nba2stata teamstats team_adv [, teamstats_options]
    
  • To scrape crew roster information into Stata, use
    nba2stata teamroster team_adv [, teamroster_options]
    

Similar to with nfl2stata, you will want to make use of Stata instructions like collapse, gsort, and merge to generate the statistics, kind the info, and merge two or extra NBA datasets collectively to look at the info.

Examples

One factor I’m at all times inquisitive about is which school groups produce probably the most NBA gamers. That is simple to search out out utilizing nba2stata, collapse, and gsort.

. nba2stata playerprofile "_all", clear
Processing x/4308 requests
.........x.........x.........x.........x.........50
.........x.........x.........x.........x.........100
.........x.........x.........x.........x.........150

.........x.........x.........x.........x.........4250
.........x.........x.........x.........x.........4300
........
4308 remark(s) loaded

. save playerprofile, substitute
(be aware: file playerprofile.dta not discovered)
file playerprofile.dta saved

. drop if faculty == ""
(114 observations deleted)

. gen ct = 1

. collapse (depend) ct, by(faculty)

. gsort -ct

. checklist in 1/10

     +---------------------+
     |         faculty   ct |
     |---------------------|
  1. |       Kentucky   97 |
  2. |           UCLA   86 |
  3. | North Carolina   80 |
  4. |           Duke   70 |
  5. |         Kansas   69 |
     |---------------------|
  6. |        Indiana   57 |
  7. |     Notre Dame   55 |
  8. |     Louisville   53 |
  9. |        Arizona   51 |
 10. |       Syracuse   50 |
     +---------------------+

Due to the quantity of knowledge fetched, you may wish to save the participant profile information after fetching it as a result of it does take a while to obtain. On my machine, it took about an hour. The time largly relies on the quantity of knowledge that have to be fetched. Within the above case, it’s all of the participant profile information from the NBA.

One other fascinating instance could be to search out the oldest and youngest groups within the NBA. You need to use the crew roster to do that.

. nba2stata teamroster _all, season(2017) clear
Processing x/30 requests
.........x.........x.........x
502 remark(s) loaded

. collapse (imply) age, by(teamname)

. kind age

. checklist teamname age in 1/5

     +---------------------------------+
     |              teamname       age |
     |---------------------------------|
  1. |          Phoenix Suns   24.4706 |
  2. | Portland Trailblazers   24.8125 |
  3. |         Chicago Bulls   24.8889 |
  4. |         Atlanta Hawks   25.2222 |
  5. |         Brooklyn Nets   25.3529 |
     +---------------------------------+

. checklist teamname age in -5/l

     +---------------------------------+
     |              teamname       age |
     |---------------------------------|
 26. |    Washington Wizards     27.75 |
 27. |     San Antonio Spurs   28.3529 |
 28. | Golden State Warriors   28.6667 |
 29. |   Cleveland Cavaliers        29 |
 30. |       Houston Rockets   29.1765 |
     +---------------------------------+

Implementation

Once more, Chris used Stata’s Java plugins and Gson to write down nearly all of the command.



Related Articles

Latest Articles