Github Repository

To see the original Github Repository, click https://github.com/h-tu/cs320final
You can find all the data, rmd file and this html in this repository.

Table of Contents:

1. Introduction

  • Background
  • Project Motivation
  • Glossary

2. Prepare Data

  • Data Scraping and Cleaning
  • Library Needed
  • Scraping Data
  • Organize Dataset

3. Data Analysis

  • Attribute vs. Wininning percentage over year

4. Machine Learning with Python

  • Model training & result interpretation

5. Conclusion

6. Additional Information


Introduction

1.1 Background

Basketball is created by Canadian physical education instructor James Naismith in 1891. As time goes, the rules keep changing and the popularity grows a lot. Today, basketball is one of the most popular sports around the World. For more information about NBA, check https://en.wikipedia.org/wiki/National_Basketball_Association. NBA represents the highest level of the basketball. We have seen a lot of greatest players in history of NBA, like Bill Russel, Wilt Chamberlain, Magic Johnson. Larry Bird, Michael Jordan, Hakeem Olajuwon, Shaquille O’Neal, Allen Iverson, Kobe Bryant, Lebron James. But today, NBA begin to change and focus more on three points shooting.
In last six seasons, Golden State Warries won three championships and accessed to five finals. It can be said they are the most dominate team in the NBA. A big reason for their rise is the “deadly” three points shooting by “Splash brothers” Stephen Curry and Klay Thompson. But if you are watching NBA in 2000, you will not believe that three points shooting will become that important. In that time, NBA was dominated by great centers like Shaquille O’Neal.
The offensive style changed a lot in today’s NBA. Back in 1999, the Spurs were using 88.6 possessions per 48 minutes according to Basketball-Reference.com. In 2017, Golden State Warriors used 102.24 possessions per 48 minutes. Both of those teams won the title in those respective years. With a faster pace, that means there’s more points scored across the league and the 3-point ball has a lot to do with that.
One of the greatest coaches of all time Gregg Popovich said “Everything is about understanding it’s about the rules of the league and what you have to do to win. And these days what’s changed it is that everybody can shoot threes.”

1.2 Project Motivation

As said in the introduction, NBA has changed a lot of its offense and defense, every team played faster and shoot more threes. It can be said that NBA entered the era of “three points shooting”. Our team is interested in how NBA is changed according to data.

In order to do the investigation, we tried to scrape the data from the official website of NBA, but there seems to be a protection of the web producer that forbidden unauthorized users to use the data from their website. Then we searched on the internet and tend to find the best data website of NBA. After some comparison, we decide to scrape the data from the website https://www.basketball-reference.com/leagues/NBA_2020.html#all_team-stats-base. We used the table of Miscellaneous Stats. We will analyze the relationship between winning percentage with different attributes like three points attempt rate. We also wants to find the difference in different categories, like pace, through 2000-2019. We will use data science and machine learning to predict NBA games.

For more information about the techniques we used, check https://www.insidescience.org/news/artificial-intelligence-nba-basketball. Also, https://en.wikipedia.org/wiki/Logistic_regression.

1.3 Glossary

Since every column has its abbreviate name. So we provide you the glossary.
Age – Player’s age on February 1 of the season
W – Wins
L – Losses
PW – Pythagorean wins, i.e., expected wins based on points scored and allowed
PL – Pythagorean losses, i.e., expected losses based on points scored and allowed
MOV – Margin of Victory
SOS – Strength of Schedule; a rating of strength of schedule. The rating is denominated in points above/below average, where zero is average.
SRS – Simple Rating System; a team rating that takes into account average point differential and strength of schedule. The rating is denominated in points above/below average, where zero is average.
ORtg – Offensive Rating. An estimate of points produced (players) or scored (teams) per 100 possessions
DRtg – Defensive Rating
An estimate of points allowed per 100 possessions
NRtg – Net Rating; an estimate of point differential per 100 possessions.
Pace – Pace Factor: An estimate of possessions per 48 minutes
FTr – Free Throw Attempt Rate.Number of FT Attempts Per FG Attempt
X3PAr or 3PFGAR– 3-Point Attempt Rate. Percentage of FG Attempts from 3-Point Range
TS – True Shooting Percentage. A measure of shooting efficiency that takes into account 2-point field goals, 3-point field goals, and free throws.Offense Four Factors
eFG – Effective Field Goal Percentage. This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.
TOV – Turnover Percentage. An estimate of turnovers committed per 100 plays.
ORBOffensive Rebound Percentage. An estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.
FT/FGA – Free Throws Per Field Goal Attempt. Defense Four Factors
DRB. – Defensive Rebound Percentage. An estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.
DRB – Defensive Rebound Ball
ORB – Offensive Rebound Ball
TRB – Total Rebound Ball
AST – Assistant
G – Games
MP – Minutes Played
FG – Field Goals
FGA – Field Goal Attempts
FG. – Field Goal Percentage
X3P or 3PFG – 3-Point Field Goals
X3PA or 3PFGA– 3-Point Field Goal Attempts
X3P. or 3PFGAP – 3-Point Field Goal Percentage
X2P or 2PFG– 2-Point Field Goals
X2PA or 2PFGA– 2-point Field Goal Attempts
X2P. or 2PFGP– 2-Point Field Goal Percentage
Attend. – Attendance
WP – Winning Percentage

Prepare data


1

Data scrapping and cleaning

In [1]:
import requests
import pandas as pd
from bs4 import Comment
from bs4 import BeautifulSoup
In [2]:
df = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()

# Set time period --> from 2000 to 2020
for year in range(2000,2020):
    # Get the seasonal stats for each team
    url = 'https://www.basketball-reference.com/leagues/NBA_{}.html#all_team-stats-base'.format(str(year))
    page = requests.get(url)

    soup = BeautifulSoup(page.text, 'html.parser')
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))

    tables = []
    for each in comments:
        if 'table' in each:
            try:
                tables.append(pd.read_html(each)[0])
            except:
                continue
    
    # Miscellaneous stats, including technical analysis like usage rate and offensive rating
    tmp = tables[7]
    tmp.columns = tmp.columns.droplevel()
    tmp.drop(tmp[tmp['Team'] == 'League Average'].index, inplace = True)
    tmp.insert(1,'yearID',str(year))
    del tmp['Rk']
    del tmp['Arena']
    
    df = df.append(tmp)
    
    # Per Game Stats, including points, assist, block, turnover, 3 Pointer
    tmp2 = tables[1]
    tmp2.drop(tmp2[tmp2['Team'] == 'League Average'].index, inplace = True)
    tmp2.insert(0,'yearID',str(year))
    df3 = df3.append(tmp2)
    
    # Get each game's data, including home team, away team, score
    url = 'https://www.basketball-reference.com/leagues/NBA_{}_games.html'.format(str(year))
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    table = soup.find('div', class_='overthrow table_container')
    rows = table.find_all('tr')
    
    data = [[],[],[],[]]
    
    for r in rows:
        tmp = r.find_all('td')
        for idx, val in enumerate(tmp):
            
            tmp = val.text
            
            if year <= 2000:
                idx = idx + 1
            
            if idx == 1 or idx == 3:
                data[idx-1].append(tmp)
            elif idx == 2 or idx == 4:
                data[idx-1].append(int(tmp))
    
    # Use the score for each team and convert that into a single categorical attribute
    home_win = [0 if (data[1])[i] < (data[3])[i] else 1 for i in range(len(data[2]))]

    
    d = {'home_team' : data[0],'away_team' : data[2], 'home_win': home_win}
    schedule = pd.DataFrame(d)
    schedule.insert(0,'yearID',str(year))
    df2 = df2.append(schedule)
In [3]:
# Join Per Game Stats with Miscellaneous stats on year and team name
df = df.merge(df3, on = ['Team', 'yearID'], suffixes=(False, False))

name = [i.replace('*','') if '*' in i else i for i in df.Team.tolist()]
new_df = pd.DataFrame({'Team': name})
In [4]:
df.update(new_df)
df
Out[4]:
yearID Team Age W L PW PL MOV SOS SRS ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 2000 Los Angeles Lakers 29.2 67.0 15.0 64 18 8.55 -0.14 8.41 ... 0.696 13.6 33.4 47.0 23.4 7.5 6.5 13.9 22.5 100.8
1 2000 Portland Trail Blazers 29.6 59.0 23.0 59 23 6.40 -0.04 6.36 ... 0.760 11.8 31.2 43.0 23.5 7.7 4.8 15.2 22.7 97.5
2 2000 San Antonio Spurs 30.9 53.0 29.0 58 24 5.94 -0.02 5.92 ... 0.746 11.3 32.5 43.8 22.2 7.5 6.7 15.0 20.9 96.2
3 2000 Phoenix Suns 28.6 53.0 29.0 56 26 5.22 0.02 5.24 ... 0.759 12.5 31.2 43.7 25.6 9.1 5.3 16.7 24.1 98.9
4 2000 Utah Jazz 31.5 55.0 27.0 54 28 4.46 0.05 4.52 ... 0.773 11.4 29.6 41.0 24.9 7.7 5.4 14.9 24.5 96.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
590 2019 Atlanta Hawks 25.1 29.0 53.0 27 55 -6.02 -0.04 -6.06 ... 0.752 11.6 34.5 46.1 25.8 8.2 5.1 17.0 23.6 113.3
591 2019 Chicago Bulls 24.0 22.0 60.0 21 61 -8.41 0.10 -8.32 ... 0.783 8.8 34.1 42.9 21.9 7.4 4.3 14.1 20.3 104.9
592 2019 Phoenix Suns 24.0 19.0 63.0 19 63 -9.34 0.73 -8.61 ... 0.779 9.1 31.3 40.4 23.9 9.0 5.1 15.6 23.6 107.5
593 2019 New York Knicks 23.4 17.0 65.0 19 63 -9.21 0.28 -8.93 ... 0.759 10.5 34.3 44.7 20.1 6.8 5.1 14.0 20.9 104.6
594 2019 Cleveland Cavaliers 25.2 19.0 63.0 19 63 -9.61 0.22 -9.39 ... 0.792 10.7 31.9 42.7 20.7 6.5 2.4 13.5 20.0 104.5

595 rows × 51 columns

In [5]:
# Match the stats for each team in the record for each game
title = df.columns.tolist()[30:50]
h_stats = [[] for x in range(len(title))]
a_stats = [[] for x in range(len(title))]

yr = df2['yearID'].tolist()
hn = df2['home_team'].tolist()
an = df2['away_team'].tolist()

for i in range(len(yr)):
    h_item = df[(df.Team == hn[i]) & (df.yearID == yr[i])]
    a_item = df[(df.Team == an[i]) & (df.yearID == yr[i])]
    for j in range(len(title)):
        h_val = h_item[title[j]].tolist()
        h_stats[j].append(h_val[0])
        
        a_val = a_item[title[j]].tolist()
        a_stats[j].append(a_val[0])
        
for index, item in enumerate(title):
    h_item = 'h_' + item
    a_item = 'a_' + item
    df2[h_item] = h_stats[index]
    df2[a_item] = a_stats[index]
In [6]:
df2['yearID'] = df2['yearID'].astype('int64')
df2
Out[6]:
yearID home_team away_team home_win h_FG a_FG h_FGA a_FGA h_FG% a_FG% ... h_AST a_AST h_STL a_STL h_BLK a_BLK h_TOV a_TOV h_PF a_PF
0 2000 Orlando Magic Charlotte Hornets 0 38.6 35.8 85.5 79.7 0.452 0.449 ... 20.8 24.7 9.1 8.9 5.7 5.9 17.6 14.7 24.0 20.4
1 2000 Golden State Warriors Dallas Mavericks 0 36.5 39.0 87.1 85.9 0.420 0.453 ... 22.6 22.1 8.9 7.2 4.3 5.1 15.9 13.7 24.9 21.6
2 2000 Phoenix Suns Denver Nuggets 0 37.7 37.3 82.6 84.3 0.457 0.442 ... 25.6 23.3 9.1 6.8 5.3 7.5 16.7 15.6 24.1 23.9
3 2000 Milwaukee Bucks Houston Rockets 1 38.7 36.6 83.3 81.3 0.465 0.450 ... 22.6 21.6 8.2 7.5 4.6 5.3 15.0 17.4 24.6 20.3
4 2000 Seattle SuperSonics Los Angeles Clippers 1 37.9 35.1 84.7 82.4 0.447 0.426 ... 22.9 18.0 8.0 7.0 4.2 6.0 14.0 16.2 21.7 22.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105 2019 Utah Jazz Minnesota Timberwolves 0 40.4 41.6 86.4 91.3 0.468 0.456 ... 26.0 24.6 8.1 8.3 5.9 5.0 15.1 13.1 21.1 20.3
106 2019 Indiana Pacers New York Knicks 1 41.3 38.2 87.0 88.3 0.475 0.433 ... 26.0 20.1 8.7 6.8 4.9 5.1 13.7 14.0 19.4 20.9
107 2019 New Orleans Pelicans Golden State Warriors 0 43.7 44.0 92.2 89.8 0.473 0.491 ... 27.0 29.4 7.4 7.6 5.4 6.4 14.8 14.3 21.1 21.4
108 2019 Dallas Mavericks Los Angeles Lakers 0 38.8 42.6 86.9 90.5 0.447 0.470 ... 23.4 25.6 6.5 7.5 4.3 5.4 14.2 15.7 20.1 20.7
109 2019 San Antonio Spurs Phoenix Suns 1 42.3 40.1 88.4 87.4 0.478 0.459 ... 24.5 23.9 6.1 9.0 4.7 5.1 12.1 15.6 18.1 23.6

1240 rows × 44 columns

In [7]:
df.to_csv(r'C:\\Users\\TomTu\\OneDrive - University of Maryland\\2020 Spring\\CMSC 320\\\nba_data.csv', index = False)
df2.to_csv(r'C:\\Users\\TomTu\\OneDrive - University of Maryland\\2020 Spring\\CMSC 320\\\game_data.csv', index = False)
part2.utf8

Library

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

Original Data

path = "/Users/zhaoyizhuang/Downloads/cs320final-master/nba_data.csv"
data  <- read.csv(path)
data %>% head()
##   yearID                   Team  Age  W  L PW PL  MOV   SOS  SRS  ORtg  DRtg
## 1   2000     Los Angeles Lakers 29.2 67 15 64 18 8.55 -0.14 8.41 107.3  98.2
## 2   2000 Portland Trail Blazers 29.6 59 23 59 23 6.40 -0.04 6.36 107.9 100.8
## 3   2000      San Antonio Spurs 30.9 53 29 58 24 5.94 -0.02 5.92 105.0  98.6
## 4   2000           Phoenix Suns 28.6 53 29 56 26 5.22  0.02 5.24 104.6  99.0
## 5   2000              Utah Jazz 31.5 55 27 54 28 4.46  0.05 4.52 107.3 102.3
## 6   2000         Indiana Pacers 30.4 56 26 54 28 4.60 -0.45 4.15 108.5 103.6
##   NRtg Pace   FTr X3PAr   TS.  eFG. TOV. ORB. FT.FGA eFG..1 TOV..1 DRB.
## 1  9.1 93.3 0.346 0.153 0.525 0.484 12.7 30.6  0.241  0.443   13.4 73.1
## 2  7.1 89.9 0.316 0.175 0.546 0.501 14.5 30.3  0.240  0.461   13.8 72.4
## 3  6.4 90.8 0.346 0.138 0.535 0.488 14.3 27.8  0.258  0.451   13.5 73.0
## 4  5.6 94.0 0.286 0.184 0.532 0.491 15.2 29.3  0.217  0.454   15.7 70.5
## 5  5.0 89.6 0.337 0.134 0.540 0.490 14.3 29.5  0.260  0.477   15.0 73.2
## 6  4.9 93.1 0.302 0.224 0.552 0.503 13.3 24.9  0.245  0.469   12.6 71.5
##   FT.FGA.1 Attend.  G    MP   FG  FGA   FG. X3P X3PA  X3P.  X2P X2PA  X2P.   FT
## 1    0.222  771420 82 19805 3276 7288 0.450 534 1656 0.322 2742 5632 0.487 1521
## 2    0.217  835078 82 19830 3044 6635 0.459 439 1223 0.359 2605 5412 0.481 1956
## 3    0.188  884450 82 19730 3195 7047 0.453 519 1326 0.391 2676 5721 0.468 1407
## 4    0.245  773115 82 19730 3047 6640 0.459 583 1487 0.392 2464 5153 0.478 1629
## 5    0.256  801268 82 19855 3174 6827 0.465 394 1069 0.369 2780 5758 0.483 1558
## 6    0.197  752145 82 19805 3137 6836 0.459 344 1047 0.329 2793 5789 0.482 1649
##    FTA   FT.  ORB  DRB  TRB  AST STL BLK  TOV   PF  PTS
## 1 2016 0.754 1056 2635 3691 1953 787 381 1325 1729 8607
## 2 2506 0.781  917 2458 3375 1707 665 273 1288 2011 8483
## 3 1751 0.804  931 2444 3375 1810 592 416 1124 1770 8316
## 4 2008 0.811  842 2612 3454 1857 559 422 1159 1786 8306
## 5 1982 0.786 1016 2373 3389 1852 671 381 1230 2020 8300
## 6 2368 0.696 1117 2738 3855 1921 613 534 1143 1841 8267

Organize Dataset

To find the relationship between those stats and the winning percentage. We have to first add a column contains the winning percentage. We use the formula Win/(Win+Lose) to find the winning percentage. Also, we add a new column called year to represent 5 year intervals.

data <- data %>% mutate(WP = W/(W+L))
#cut year into 5 intervals
data <- data %>%
  mutate(year = cut(yearID, breaks = 5))
data %>% head()
##   yearID                   Team  Age  W  L PW PL  MOV   SOS  SRS  ORtg  DRtg
## 1   2000     Los Angeles Lakers 29.2 67 15 64 18 8.55 -0.14 8.41 107.3  98.2
## 2   2000 Portland Trail Blazers 29.6 59 23 59 23 6.40 -0.04 6.36 107.9 100.8
## 3   2000      San Antonio Spurs 30.9 53 29 58 24 5.94 -0.02 5.92 105.0  98.6
## 4   2000           Phoenix Suns 28.6 53 29 56 26 5.22  0.02 5.24 104.6  99.0
## 5   2000              Utah Jazz 31.5 55 27 54 28 4.46  0.05 4.52 107.3 102.3
## 6   2000         Indiana Pacers 30.4 56 26 54 28 4.60 -0.45 4.15 108.5 103.6
##   NRtg Pace   FTr X3PAr   TS.  eFG. TOV. ORB. FT.FGA eFG..1 TOV..1 DRB.
## 1  9.1 93.3 0.346 0.153 0.525 0.484 12.7 30.6  0.241  0.443   13.4 73.1
## 2  7.1 89.9 0.316 0.175 0.546 0.501 14.5 30.3  0.240  0.461   13.8 72.4
## 3  6.4 90.8 0.346 0.138 0.535 0.488 14.3 27.8  0.258  0.451   13.5 73.0
## 4  5.6 94.0 0.286 0.184 0.532 0.491 15.2 29.3  0.217  0.454   15.7 70.5
## 5  5.0 89.6 0.337 0.134 0.540 0.490 14.3 29.5  0.260  0.477   15.0 73.2
## 6  4.9 93.1 0.302 0.224 0.552 0.503 13.3 24.9  0.245  0.469   12.6 71.5
##   FT.FGA.1 Attend.  G    MP   FG  FGA   FG. X3P X3PA  X3P.  X2P X2PA  X2P.   FT
## 1    0.222  771420 82 19805 3276 7288 0.450 534 1656 0.322 2742 5632 0.487 1521
## 2    0.217  835078 82 19830 3044 6635 0.459 439 1223 0.359 2605 5412 0.481 1956
## 3    0.188  884450 82 19730 3195 7047 0.453 519 1326 0.391 2676 5721 0.468 1407
## 4    0.245  773115 82 19730 3047 6640 0.459 583 1487 0.392 2464 5153 0.478 1629
## 5    0.256  801268 82 19855 3174 6827 0.465 394 1069 0.369 2780 5758 0.483 1558
## 6    0.197  752145 82 19805 3137 6836 0.459 344 1047 0.329 2793 5789 0.482 1649
##    FTA   FT.  ORB  DRB  TRB  AST STL BLK  TOV   PF  PTS        WP        year
## 1 2016 0.754 1056 2635 3691 1953 787 381 1325 1729 8607 0.8170732 (2000,2004]
## 2 2506 0.781  917 2458 3375 1707 665 273 1288 2011 8483 0.7195122 (2000,2004]
## 3 1751 0.804  931 2444 3375 1810 592 416 1124 1770 8316 0.6463415 (2000,2004]
## 4 2008 0.811  842 2612 3454 1857 559 422 1159 1786 8306 0.6463415 (2000,2004]
## 5 1982 0.786 1016 2373 3389 1852 671 381 1230 2020 8300 0.6707317 (2000,2004]
## 6 2368 0.696 1117 2738 3855 1921 613 534 1143 1841 8267 0.6829268 (2000,2004]

Data Analysis

3.1 Pace vs. Wininning percentage over year

data %>% ggplot(aes(x = Pace, y = WP, color = yearID)) + 
  geom_point() + 
  labs(title = "Winning percentage vs. Pace", 
       x = "Pace",
       y = "Winning percentage") +
  geom_smooth(method=lm)


Pace is an estimate of possessions per 48 minutes. A possession in basketball means one team ends it offense and turn to defense. There are a lot of ways to end one teams offense possession, it can be one player scored, on player missed shot and one player turned over. As the graph shows, we can find out that the pace increased through 2000 to 2019 in NBA. Every team played more and more possessions in 48 minutes. As we all know, except overtimes, every game is 48 minutes, which have not changed through 2000 to 2019. In rules of NBA, each offensive possesion is 24 seconds. This means in each game, two teams need to shoot the ball faster in every possession. Also, in this season, time for every possension after an offensive rebounds change from 24 seconds to 12 seconds. So I believe pace in the future will keep increased. However, we can not conclude any relationship between pace and winning percentage through the graph.

3.2 relationship between Offensive rating and WP over time

data %>% ggplot(aes(x = ORtg, y = WP, color = yearID)) + 
  geom_point() + 
  labs(title = "Winning percentage vs. Offensive Rating", 
       x = "Offensive Rating",
       y = "Winning percentage") +
  geom_smooth(method=lm)

data %>% ggplot(aes(x = yearID, y = ORtg, color = yearID)) + 
  geom_point() + 
  labs(title = "Offensive Rating vs. Year", 
       x = "Year",
       y = "Offensive Rating") +
  geom_smooth(method=lm)


Offensive Rating is An estimate of points produced (players) or scored (teams) per 100 possessions. In the first graph, we can see a strong positive relationship between offensive rating and winning percentage. Whenever through 2000 to 2019, higher offensive rating will lead to higher winning percentage. If you want to win, you must be able to score points. It is the common rule in any sports. In the second graoh, we can find out that in general, offensive rating becomes higher and higher through 2000 to 2019. We believe it is because higher pace and more three points attempt.

3.3 How does Three-points ball affect the game over time

3.3.0

p1 = ggplot(data = data, aes(x = as.character(yearID), y = X3PA)) + geom_boxplot()
p1 + ggtitle("3-Point Field Goal Attempt Over Time") + xlab("Year") + ylab("3-Point Field Goal  Attempt")


Frorom the above graph, we can see that in recent year, the 3-point field goal(3PFG) attempt is increasing which shows that nowadyas NBA are more incling to shoot 3PFG. So, in this section, we are going to discuss why does this trend happen.

3.3.1

data %>% ggplot(aes(x = X3PAr, y = WP, color = yearID)) + 
  geom_point() + 
  labs(title = "Winning percentage vs. Three points attempt rate", 
       x = "Three points attempt rate",
       y = "Winning percentage") +
  geom_smooth(method=lm)


As the above graph shown, we can see that even though in recent few years NBA players have more attempts to shoot from the three points range, the distribution of the Winning percentage of each team does not change a lot. Namely, the three points attempt rate in NBA is increasing over year, but it actually did not have the directly relationship with the winning percentage of each team. So, it is just the trend of how NBA players play game. For the further analysis, such as what caused this trend, we need to look deeper into the data. For example, we can find the relationship between the three points field goal percentage and the winning percentage.

3.3.2

data %>% ggplot(aes(x = X3P. , y = WP, color = yearID)) + 
  geom_point() + 
  labs(title = "Winning percentage vs. Three points field goal percentage", 
       x = "Three points field goal percentage",
       y = "Winning percentage") +
  geom_smooth(method=lm)


According to the above graph, we can see a regression line that shows the relationship between three points field goal percentage and winning percentage for each team. Though it is not very clear, we still can see that winning percentage is higher when the three points field goal percentage is higher, especially for recent few years. Namely, if a team has a very high three points field goal percentage, this team is more likely to win the game. So, this can be one factor that explains the trend that why NBA teams have a higher three points attepmt rate than before.

However, we cannot conclude that the reason why NBA teams nowadays have a much higher average three-points attempt rate than before is because higher X3P. (3-Point Field Goal Percentage). Because As the below shown, a team with a high X2P. (2-point Field Goal Percentage) will also has a high winning percentage as well.

3.3.3

data %>% ggplot(aes(x = X2P. , y = WP, color = yearID)) + 
  geom_point() + 
  labs(title = "Two points field goal percentage vs. Winning percentage", 
       x = "Two points field goal percentage",
       y = "Winning percentage") +
  geom_smooth(method=lm)


So, we now look deeper into the dataset to figure out the relationship between FGA (field goal attempt) and FGP(Field Goal Percentage).

3.3.4

data %>% ggplot(aes(x = X3PA , y = X3P., color = WP)) + 
  geom_point() + 
  labs(title = "Three points field goal percentage vs. Three points field goal attempts", 
       x = "Three points field goal attempts",
       y = "Three points field goal percentage") +
  geom_smooth(method=lm)

data %>% ggplot(aes(x = X2PA , y = X2P., color = WP)) + 
  geom_point() + 
  labs(title = "Two points field goal percentage vs. Two points field goal attempts", 
       x = "Two points field goal attempts",
       y = "Two points field goal percentage") +
  geom_smooth(method=lm)


Based on above two graphs, we draw two regression lines which shows the relationship between FGA and FGP. And we can clearly see that 3PFG(3-points field goal) attempt is directly proportional to 3PFG (3-points field goal) percentage while 2PFG(2-point field goal) attempt is inversely proportional to 2PFG percentage. So, if we only look at the data, we can say that more 2PFG attempt leads to lower 2PFG percentage. And based on the graph on 3.3.3, the lower 2PFG percentage leads to lower winning percentage. The same idea for 3PFG. More 3PFG attempt leads to slightly higher 3PFG precentage, which based on 3.3.2, can lead to a higher winning percentage. This can be a reason to explain why nowadays teams decide to shoot from 3-points range.

In fact, there always are more than one reason to form a trend. NBA teams nowadays have a higher average 3PFG attempt rate than beofore may be cuased by the reason that audience want to see 3-points game. Namely, nowadays audience are more inclined to see how NBA players kill the game by shooting 3-points. The different Aesthetic leads to the change of the NBA gaming model. So, we will compare the attendance with the 3PFG attempts to see how do these two things relate to each other.

3.3.5

data %>% ggplot(aes(x = X3PA , y = Attend., color = yearID)) + 
  geom_point() + 
  labs(title = "Attendance vs. Three points field goal attempts", 
       x = "Three points field goal attempts",
       y = "Attendance") +
  geom_smooth(method=lm)


So, as the above graph has shown, the attendance number of the audience is directly proportional to the number of three points field goal attempts. This means that people are more willing to see the team which is good at three points field goal. Beside the change of audience’s Aesthetic and the goal to win, this trend may still has some relationships with the change of the NBA rules and styles. NBA now encourages teams to play a fast paced game, which may leads to the trend that 3FPG attempt rate rises. After comparing the pace and the 3PFG attempt, as shown below, the above hypothesis can be accepted.

3.3.6

data %>% ggplot(aes(x = X3PA , y = Pace, color = yearID)) + 
  geom_point() + 
  labs(title = "Pace vs. Three points field goal attempts", 
       x = "Three points field goal attempts",
       y = "Pace") +
  geom_smooth(method=lm)


3.3.7

So, based on what we did so far, we can see that in recent few years, the three points field goal attempt rate is much higher than it in before. We try to find the reason behind it. Based on what we got from the dataset, we state that it may be caused by the changing of the game style, the changing of audience’s Aesthetic and the goal to win.

3.4

In this section we want to discuss more about the change of the trend of how teams play in NBA. So, we draw several graphs for attribute vs. winning percentage based on 5 year intervals. By this way, we can see more clearly that how does an attribute contributes to the game during a specific time period.

3.4.1

Below graph shows Winning Percentage vs. Total Rebound Ball over year.

data %>%
  ggplot(aes(x=TRB, y=WP)) +
    geom_point(aes(color = year)) + 
    facet_wrap(~year) +
    xlab("Total Reebound Ball") + ylab("Winning Perercentage") + 
    ggtitle("Winning Percentage vs Total Reebound Ball") + 
    geom_smooth(method = 'lm') + labs(color = "Time period")

regression <- lm(WP~TRB*year, data = data)
model <- regression  %>% broom::tidy()
model
## # A tibble: 10 x 5
##    term                   estimate std.error statistic  p.value
##    <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)         -0.729      0.333       -2.19   0.0291  
##  2 TRB                  0.000352   0.0000954    3.69   0.000243
##  3 year(2004,2008]      0.399      0.496        0.804  0.422   
##  4 year(2008,2011]     -0.418      0.490       -0.853  0.394   
##  5 year(2011,2015]      1.06       0.357        2.98   0.00297 
##  6 year(2015,2019]     -0.0587     0.450       -0.130  0.896   
##  7 TRB:year(2004,2008] -0.000109   0.000144    -0.754  0.451   
##  8 TRB:year(2008,2011]  0.000130   0.000142     0.918  0.359   
##  9 TRB:year(2011,2015] -0.000303   0.000103    -2.95   0.00335 
## 10 TRB:year(2015,2019]  0.00000453 0.000127     0.0357 0.972


So, based on the above graph and statistics, we can see that in 2015-2019, teams grabbed more than other four time periods. It is may caused by the reason that pace is faster. The line is flatter in 2011-2015 than other four graphs. This is because the points are distributed more seperately in horizon. In general, more rebound balls bring higher winning percentage.

3.4.2

data %>%
  ggplot(aes(x=AST, y=WP, color = year)) +
    geom_point() +
    xlab("Assistant") + ylab("Winning Perercentage") + 
    ggtitle("Winning Percentage vs Assistant") + 
    geom_smooth(method = 'lm') + labs(color = "Time period")

regression <- lm(WP~AST*year, data = data)
model <- regression  %>% broom::tidy()
model
## # A tibble: 10 x 5
##    term                  estimate std.error statistic      p.value
##    <chr>                    <dbl>     <dbl>     <dbl>        <dbl>
##  1 (Intercept)         -0.357     0.153        -2.33  0.0202      
##  2 AST                  0.000478  0.0000852     5.61  0.0000000314
##  3 year(2004,2008]      0.234     0.213         1.10  0.273       
##  4 year(2008,2011]     -0.0848    0.225        -0.377 0.707       
##  5 year(2011,2015]      0.588     0.181         3.24  0.00126     
##  6 year(2015,2019]      0.0744    0.201         0.371 0.711       
##  7 AST:year(2004,2008] -0.000118  0.000121     -0.977 0.329       
##  8 AST:year(2008,2011]  0.0000600 0.000127      0.473 0.636       
##  9 AST:year(2011,2015] -0.000319  0.000102     -3.13  0.00185     
## 10 AST:year(2015,2019] -0.0000659 0.000109     -0.605 0.545


From above graphs and statistics, we can draw a conclusion that More assistants lead to a higher winning percentage. Namely, whether a team win or not depends on the number of the Assistants in some degree. In 2015 - 2019, the the number of Assistants is more than other four time periods. It may caused by the fact that there are more offensive positions which causes more opportunities to gain assistants.

3.4.3

data %>%
  ggplot(aes(x=X3P., y=WP)) +
    geom_point(aes(color = year)) + 
    facet_wrap(~year) +
    xlab("3PFG Percentage") + ylab("Winning Perercentage") + 
    ggtitle("Winning Percentage vs 3PFG Percentage") + 
    geom_smooth(method = 'lm') + labs(color = "Time period")

regression <- lm(WP~X3P.*year, data = data)
model <- regression  %>% broom::tidy()
model
## # A tibble: 10 x 5
##    term                 estimate std.error statistic    p.value
##    <chr>                   <dbl>     <dbl>     <dbl>      <dbl>
##  1 (Intercept)           -0.391      0.197    -1.99  0.0472    
##  2 X3P.                   2.54       0.560     4.54  0.00000689
##  3 year(2004,2008]        0.0406     0.315     0.129 0.897     
##  4 year(2008,2011]       -0.610      0.306    -1.99  0.0468    
##  5 year(2011,2015]       -0.261      0.297    -0.881 0.379     
##  6 year(2015,2019]       -0.483      0.345    -1.40  0.162     
##  7 X3P.:year(2004,2008]  -0.132      0.893    -0.148 0.882     
##  8 X3P.:year(2008,2011]   1.65       0.860     1.91  0.0563    
##  9 X3P.:year(2011,2015]   0.723      0.841     0.859 0.391     
## 10 X3P.:year(2015,2019]   1.31       0.971     1.35  0.178


As the above graphs shown, the cluster of points and the regression line are moving to the right, which means the 3-point field goal percentage is improving and the number of 3-point field goal is increasing. Also, 3-point field goal percentage has positive relationship with the winning percentage, which means higher 3-point field goal percentage leads to win.

Machine Learning with Python


2

Model training & result interpretation

In [23]:
import numpy as np
import sklearn.metrics
import seaborn as sns
import matplotlib.pylab as plt
import statsmodels.formula.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold
In [9]:
sns.set(rc={'figure.figsize':(15,12)})

We tried to train with data before 2000, but since the rules have been changing, older data is not a good for the model no more. The best timeframe I found is about 2000 to 2019. And this is what I went with. Here we are feeding in all the attributes for each team and that's what's in train dataset, and then we set our y to be output, hosting categorical data in df.homewin, which is all 0 and 1s indicting if home team won the game or not.

In [24]:
tmp = df2.columns.tolist()[4:50]
train = df2[df2.yearID < 2019]
test = df2[df2.yearID == 2019]

x_train = train[tmp]
x_test = test[tmp]
y_train = train.home_win
y_test = test.home_win
In [25]:
model = LogisticRegression(n_jobs=8)
model.fit(x_train, y_train)
Out[25]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=8, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [26]:
y_test_pred = model.predict(x_test)
y_train_pred = model.predict(x_train)
In [27]:
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
score_trian = model.score(x_train, y_train)
score_test = model.score(x_test, y_test)

print("mse_train: {}\nmse_test: {}\nscore_trian: {}\nscore_test: {}\n".format(mse_train,mse_test,score_trian,score_test))
mse_train: 0.3070796460176991
mse_test: 0.2636363636363636
score_trian: 0.6929203539823009
score_test: 0.7363636363636363

Since there are only limited data due to rule changes, and there are only around 80 games each season, the best I can achieve is around 73.6%, which is not bad, and also much better as compared to a random guess (50%).

In [14]:
output = pd.DataFrame()

a_tmp = [-1 * x for x in model.coef_[0][1::2]]

output['attribute'] = title
output['h_weight'] = model.coef_[0][::2]
output['a_weight'] = a_tmp

output
Out[14]:
attribute h_weight a_weight
0 FG 0.226895 0.194331
1 FGA -0.186685 -0.153776
2 FG% 0.003272 0.003851
3 3P 0.086181 0.070963
4 3PA -0.041852 -0.076499
5 3P% 0.003980 0.007332
6 2P 0.153479 0.110015
7 2PA -0.149417 -0.090220
8 2P% 0.002929 0.002201
9 FT 0.068581 0.088320
10 FTA -0.069059 -0.074324
11 FT% 0.003845 0.005520
12 ORB 0.185919 0.059421
13 DRB 0.029739 0.093776
14 TRB 0.199245 0.147955
15 AST -0.002619 0.011860
16 STL 0.142090 0.207814
17 BLK 0.190082 0.000109
18 TOV -0.331940 -0.277124
19 PF 0.058421 -0.031567
In [15]:
sns.barplot(x = "attribute", y = "h_weight", data = output.sort_values(by=['h_weight'], ascending=False))
plt.title('2000 to 2019 -- Home team attribute vs. weight')
plt.show()
sns.barplot(x = "attribute", y = "a_weight", data = output.sort_values(by=['a_weight'], ascending=False))
plt.title('2000 to 2019 -- Away team attribute vs. weight')
plt.show()

From graph, we can see that for the past 20 years as a whole, the most import factors that contribute to winning a game are field goals made, total rebounds, blocks, two-point field goals made, steal, three-point field goals made, and turnover will hurt a team the most.

For away teams, surprisingly, steals contirbutes to road wins the most, then the same story as it somes to field goals made, total rebounds, two-point field goals made, free throws made and three-point field goals made. It's not surprising that turnovers will also hurt a team the most.

One more thing

Before I finish the study, I think I should look at the rise of the golden state warrios's time period. So I re-trained the model with data only between 2014 to 2018, and try to see how good it is at predicting the 2018-2019 season.

In [16]:
tmp = df2.columns.tolist()[4:50]
train = df2[(df2.yearID < 2019) & (df2.yearID >= 2014)]
test = df2[df2.yearID == 2019]

x_train = train[tmp]
x_test = test[tmp]
y_train = train.home_win
y_test = test.home_win
In [17]:
model = LogisticRegression(n_jobs=8)
model.fit(x_train, y_train)

y_test_pred = model.predict(x_test)
y_train_pred = model.predict(x_train)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
score_trian = model.score(x_train, y_train)
score_test = model.score(x_test, y_test)

print("mse_train: {}\nmse_test: {}\nscore_trian: {}\nscore_test: {}\n".format(mse_train,mse_test,score_trian,score_test))
mse_train: 0.2974137931034483
mse_test: 0.2909090909090909
score_trian: 0.7025862068965517
score_test: 0.7090909090909091

So I got a 70.9% correct rate for this reduced model. It's reasonable to see a drop in correctness, as there are less data. But what I care the most is to find which are the most important factors during 2014 to 2018.

In [18]:
output = pd.DataFrame()

a_tmp = [-1 * x for x in model.coef_[0][1::2]]

output['attribute'] = title
output['h_weight'] = model.coef_[0][::2]
output['a_weight'] = a_tmp

sns.barplot(x = "attribute", y = "h_weight", data = output.sort_values(by=['h_weight'], ascending=False))
plt.title('2014 to 2018 -- Home team attribute vs. weight')
plt.show()
sns.barplot(x = "attribute", y = "a_weight", data = output.sort_values(by=['a_weight'], ascending=False))
plt.title('2014 to 2018 -- Away team attribute vs. weight')
plt.show()

From the graph above, I got the same result as I predicted: three pointers are the most important contributor as a home team gets a win. For the previus model, where we are using the past 20 years to do the prediction, 3P were both ranked 7th as for the contributor. Besides home team, we can also see that the 3P's importance was raised and now is ranked 4th as it comes to predicting the win.

Conclusion.utf8

Conclusion

After we did all analyzing and modeling, we can conclude that the playing style in NBA has changed a lot through the last twenty years.

The most obvious change is that teams start to shoot more threes. The three-point field goal attempt has increased a lot, also the importance of the three-point attempt percentage keeps increasing in these twenty years. We also find that teams increased their pace in playing and it will create more rebounds and assists in the game. Higher pace will lead to more three-point attempts, which I think should be a reason for every team to play faster.

After doing the modeling, we proved again that for both home team and away team three-point becomes more and more important for them to win a game. Not surprisingly, field-goal, rebounds, assists, steals, blocks will always has positive relationship with wins, while turnover will have negative one.

In conclusion, the offensive style has changed in the league, every team tends to play faster and shoot more threes, which we proved that higher efficiency in three-points shooting can bring team wins. We believe three-points field goal will keep playing a important role in NBA at least five more years.

1

Additional Information

In this study, since our output data is categorical, we decided to used logisic regression classifier as it does well when we are trying to seperate two items. For more information on logistic classifier, here are some helpful links:

Wiki - logistic regression

Scikit-Learn API

In the process of this study, we also had some inspiration from serval youtube videos that shows how to do data analysis with python and game data, here are the links for anyone that wants to explore this topic further.

Predicting NFL games

model details