Harrison Pratt

The Value of the Major League Baseball Pitcher

The Problem

Baseball teams are usually divided into two different parts: their offense led by their batters, and their defense led by their pitchers. Every offseason, Major League Baseball teams have to make an important decision: Should they spend their money on their batting or their pitching? In this study, I am going to attempt to form an answer to this question by observing how teams value their pitchers and whether they should pay them more or less. Earlier this semester, we did an investigation of payroll efficiency for each team. This study is going to take this a step further by investigating how a team can achieve payroll efficiency by how much they should allocate to their pitching staff. While we found some interesting information in our moneyball project about how efficient teams have been, this study should provide a whole new level of strategy for general managers on which position to pay in order to achieve efficiency.

My Hypothesis

My hypothesis is that starting pitchers will be valued more than batters, and both will be valued more than relief pitchers. The most sought after statistic is earned run average, since that reflects a pitchers ability to beat the opposition. I also believe that starting pitchers are underpaid and batters are overpaid.

Retrieving and Tidying Data

I am retrieving my data from csv files located in Lahman's Baseball Database. In this cell, I will read in the data from four csv files: Pitching.csv, Batting.csv, Salaries.csv, People.csv. After that, I will do some joins so that the full player name is in the batting and pitching dataframe and also get a pitchers and batters salaries dataframe. During this process, I will standardize the payrolls across years to account for the rise in payroll rate

In [1]:
# read in salaries.csv, pitching.csv, and people.csv into dataframes
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
salaries_df = pd.read_csv("./data/Salaries.csv")
pitching_df = pd.read_csv("./data/Pitching.csv")
batting_df = pd.read_csv("./data/Batting.csv")
people_df = pd.read_csv("./data/People.csv")

# get salary aggregations
year_salary_aggs = salaries_df.groupby(["yearID"]).agg(["mean", "std"])
year_salary_aggs = year_salary_aggs["salary"]

# create new column standard salary
standardized_salaries = []
for row in salaries_df.iterrows():
    yearID = row[1]["yearID"]
    salary = row[1]["salary"]
    mean = year_salary_aggs.loc[yearID]["mean"]
    std = year_salary_aggs.loc[yearID]["std"]
    standardized_salary = (salary - mean)/std
    standardized_salaries = standardized_salaries + [standardized_salary]
salaries_df["standard_salary"] = standardized_salaries


# join pitching and people so that the player name is now part of pitching dataframe
pitching_df_names = pd.merge(pitching_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")

# get pitcher salary into pitching_df
pitching = pd.merge(pitching_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="left")
# trim pitching salaries to only get years with relevant salary data
pitching_salaries = pitching[pitching["yearID"] >= 1985]
# drop any record that has NaN for a column
pitchers = pitching_salaries.dropna()
print("\nPitcher Statistics")
display(pitchers.head())

# join batting and people so that the player name is part of batting dataframe
batting_df_names = pd.merge(batting_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")
# get batter salary into batting dataframe
batting = pd.merge(batting_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="inner")
# trim pitching salaries to only get years with relevant salary data
batting_salaries = batting[batting["yearID"] >= 1985]
# trim batters with less than 100 At Bats to get rid of pitchers from that table
batter_salaries = batting_salaries[batting_salaries["AB"] > 100]
# drop any record that has NaN for a column
batters = batter_salaries.dropna()
print("\n\nBatter Statistics")
display(batters.head())
Pitcher Statistics
playerID yearID stint teamID lgID W L G GS CG ... BK BFP GF R SH SF GIDP nameFirst nameLast standard_salary
24986 ackerji01 1985 1 TOR AL 7 2 61 0 0 ... 0 370.0 26 35 1.0 2.0 16.0 Jim Acker -0.892167
24987 agostju01 1985 1 CHA AL 4 3 54 0 0 ... 0 246.0 21 27 3.0 3.0 4.0 Juan Agosto -0.957703
24989 alexado01 1985 1 TOR AL 17 10 36 36 6 ... 0 1090.0 0 105 6.0 3.0 15.0 Doyle Alexander 1.161306
24990 allenne01 1985 1 SLN NL 1 4 23 1 0 ... 1 135.0 13 22 1.0 3.0 3.0 Neil Allen 0.797215
24991 allenne01 1985 2 NYA AL 1 0 17 0 0 ... 0 124.0 10 9 0.0 0.0 3.0 Neil Allen 0.797215

5 rows × 33 columns


Batter Statistics
playerID yearID stint teamID lgID G AB R H 2B ... BB SO IBB HBP SH SF GIDP nameFirst nameLast standard_salary
2 aguaylu01 1985 1 PHI NL 91 165 27 46 7 ... 22 26.0 5.0 6.0 4.0 3.0 7.0 Luis Aguayo -0.697014
6 almonbi01 1985 1 PIT NL 88 244 33 66 17 ... 22 61.0 0.0 1.0 4.0 3.0 6.0 Bill Almon -0.644585
9 armasto01 1985 1 BOS AL 103 385 50 102 17 ... 18 90.0 4.0 2.0 0.0 5.0 14.0 Tony Armas 1.277815
10 ashbyal01 1985 1 HOU NL 65 189 20 53 8 ... 24 27.0 2.0 1.0 1.0 1.0 9.0 Alan Ashby -0.173693
13 backmwa01 1985 1 NYN NL 145 520 77 142 24 ... 36 72.0 1.0 1.0 14.0 3.0 3.0 Wally Backman -0.804785

5 rows × 25 columns

We also want to separate starting pitchers and relief pitchers for future analysis. Since there are six months in the major league baseball season, we will classify pitchers as those who did not have at least one start a month, so any players that have at least six games started is a starting pitcher, any pitcher with fewer than six games started is a relief pitcher

In [2]:
# starting pitchers are all the pitchers that have at least 5 games started
starting_pitchers = pitchers[pitchers["GS"] >= 6]
print("\nStarting Pitcher Statistics")
display(starting_pitchers.head())
# relief pitchers are all the pitchers that have less than 5 games started
relief_pitchers = pitchers[pitchers["GS"] < 6]
print("\n\nRelief Pitcher Statistics")
display(relief_pitchers.head())
Starting Pitcher Statistics
playerID yearID stint teamID lgID W L G GS CG ... BK BFP GF R SH SF GIDP nameFirst nameLast standard_salary
24989 alexado01 1985 1 TOR AL 17 10 36 36 6 ... 0 1090.0 0 105 6.0 3.0 15.0 Doyle Alexander 1.161306
24993 andujjo01 1985 1 SLN NL 21 12 38 38 10 ... 0 1127.0 0 113 11.0 4.0 32.0 Joaquin Andujar 1.612779
24999 bannifl01 1985 1 CHA AL 10 14 34 34 4 ... 0 928.0 0 121 9.0 8.0 20.0 Floyd Bannister 0.975619
25000 barkele01 1985 1 ATL NL 2 9 20 18 0 ... 0 335.0 1 55 4.0 1.0 4.0 Len Barker 1.146742
25004 beattji01 1985 1 SEA AL 5 6 18 15 1 ... 0 335.0 1 61 0.0 5.0 5.0 Jim Beattie -0.018349

5 rows × 33 columns


Relief Pitcher Statistics
playerID yearID stint teamID lgID W L G GS CG ... BK BFP GF R SH SF GIDP nameFirst nameLast standard_salary
24986 ackerji01 1985 1 TOR AL 7 2 61 0 0 ... 0 370.0 26 35 1.0 2.0 16.0 Jim Acker -0.892167
24987 agostju01 1985 1 CHA AL 4 3 54 0 0 ... 0 246.0 21 27 3.0 3.0 4.0 Juan Agosto -0.957703
24990 allenne01 1985 1 SLN NL 1 4 23 1 0 ... 1 135.0 13 22 1.0 3.0 3.0 Neil Allen 0.797215
24991 allenne01 1985 2 NYA AL 1 0 17 0 0 ... 0 124.0 10 9 0.0 0.0 3.0 Neil Allen 0.797215
24992 anderla02 1985 1 PHI NL 3 3 57 0 0 ... 1 318.0 19 41 3.0 1.0 4.0 Larry Andersen -0.657692

5 rows × 33 columns

Examine important statistics over time

To provide a bit of context, I am going to set up a few graphs tracking some important statistics for pitchers such as earned run average and strikeouts, and for batters such as batting average and home runs

Track Pitching Statistics Over Time

Use line graphs to track earned run average and strikeouts over time

In [3]:
# plot total strikeouts over time on line graph
total_strikeouts = pitchers.groupby("yearID")["SO"].sum()
so_plot = total_strikeouts.plot.line(title="Total Strikeouts Over Time") 
so_plot.set_ylabel("Total Strikeouts")
so_plot.set_xlabel("Year")
Out[3]:
Text(0.5, 0, 'Year')

From the graph we can see that since 1985 total strikeouts have steadily increased with a massive rise around 1995 followed by a relatively small decline a couple of years after

In [4]:
# plot earned run average over time on line graph
mean_era = pitchers.groupby("yearID")["ERA"].mean()
era_plot = mean_era.plot.line(title="Earned Run Average Over Time") 
era_plot.set_ylabel("ERA")
era_plot.set_xlabel("Year")
Out[4]:
Text(0.5, 0, 'Year')

From the graph we can see that ERA was increased up until around its peak in 2000, and since then it has steadily decreased, which means that pitching has improved since 2000

Track Batting Statistics Over Time

In [5]:
# create new column Batting Avg
avgs = []
for row in batters.iterrows():
    hits = row[1]["H"]
    atbats = row[1]["AB"]
    avg = hits/atbats
    avgs = avgs + [avg]
batters["AVG"] = avgs

# plot batting average over time on line graph
batting_average = batters.groupby("yearID")["AVG"].mean()
ba_plot = batting_average.plot.line(title="Batting Average Over Time") 
ba_plot.set_ylabel("Batting Average")
ba_plot.set_xlabel("Year")
Out[5]:
Text(0.5, 0, 'Year')

From the graph we can see that batting average had a major increase in the early 1990's and stayed high for a while until the late 2000's in which it had a precipitous drop and has been very low in the recent past

In [6]:
# plot batting average over time on line graph
home_runs = batters.groupby("yearID")["HR"].mean()
hr_plot = home_runs.plot.line(title="Home Runs Over Time") 
hr_plot.set_ylabel("Home Runs")
hr_plot.set_xlabel("Year")
Out[6]:
Text(0.5, 0, 'Year')

From the graph we can see that while home runs were increasing since 1990 they started to decrease starting around 2005 but recently there has been a massive increase in the number of home runs hit per season. The spike in the middle of the graph is most likely a result of the steroid era.

Final Conclusions From Statistics Over Time

From analyzing these graphs, we can see that in the past decade there has been a decrease in batting average and earned run average, but an increase in home runs and strikeouts. These trends can be attributed to two main factors. The first factor is that batters are now attempting to hit for power instead of hitting for contact, which means that while there would be an increase in home runs, batting average would decrease and strikeouts would decrease. The second factor is that pitching has been improving, which would lower batting average and earned run average, while also increasing strikeouts.

Comparing Mean Pitcher Salary and Mean Batter Salary

Use line graphs to compare average pitcher salary to average total player salary from 1985 to 2016

In [7]:
# get mean pitcher salaries
mean_pitcher_salaries = pitchers.groupby("yearID")["standard_salary"].mean()
# get mean batter salaries
mean_batter_salaries = batters.groupby("yearID")["standard_salary"].mean()
# plot mean salaries
mean_plot = mean_pitcher_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_plot, legend=True)
mean_plot.set_title("Mean Standard Salaries Over Time")
mean_plot.set_ylabel("Mean Standard Salary")
mean_plot.set_xlabel("Year")
mean_plot.legend(["Pitchers", "Batters"])
Out[7]:
<matplotlib.legend.Legend at 0x7f478cd50c50>

From this graph we can see that since 1985 batters have always gotten paid more than pitchers, and as the years have progressed there has not been a very big change in how batters and pitchers have been paid. Interestingly enough, in the 1990's there actually was a time in which the value of batters increased and the value of pitchers decreased, but during the turn of the century this evened back out and now in the present day pitchers and batters are valued the same way that they were in 1985

Compare Starting Pitcher and Relief Pitcher Salary

One of the reasons that pitchers may be getting paid less than batters is that relief pitchers are often valued less than other positions, which would lower the overall mean pitcher salary. Therefore, in this part we want to compare salary for starting pitchers and relief pitchers to see if there is a difference

In [8]:
# get mean starting pitcher salaries
mean_starter_salaries = starting_pitchers.groupby("yearID")["standard_salary"].mean()
# get mean relief pitchers salaries
mean_relief_salaries = relief_pitchers.groupby("yearID")["standard_salary"].mean()

# plot mean starting salaries and mean relief salaries over time
mean_pitchers_plot = mean_starter_salaries.plot.line(legend=True)
mean_relief_salaries.plot.line(ax=mean_pitchers_plot, legend=True)
mean_pitchers_plot.set_title("Mean Pitcher Standard Salaries Over Time")
mean_pitchers_plot.set_ylabel("Mean Standard Salary")
mean_pitchers_plot.set_xlabel("Year")
mean_pitchers_plot.legend(["Starting Pitchers", "Relief Pitchers"])
Out[8]:
<matplotlib.legend.Legend at 0x7f478cd01358>

This plot shows that while in 1985 relief pitchers and starting pitchers were relatively close in how they were paid, since 1985 the value of starting pitchers has trended upwards, while the value of relief pitchers have trended downward, and it shows that starting pitchers are deemed more valuable than relief pitchers by a lot in the present day.

Compare Mean Batter Salary, Mean Starting Pitcher Salary, and Mean Relief Pitcher Salary

Now that we have seen that there is indeed a difference between starting pitchers and relief pitchers, we can compare starting pitchers, relief pitchers and batters to see the complete order in how they are valued

In [9]:
# plot mean starting salaries and mean relief salaries over time
mean_starters_plot = mean_starter_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_relief_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_starters_plot.set_title("Mean Standard Salary Over Time")
mean_starters_plot.set_ylabel("Mean Standard Salary")
mean_starters_plot.set_xlabel("Year")
mean_starters_plot.legend(["Starting Pitchers", "Batters", "Relief Pitchers"])
Out[9]:
<matplotlib.legend.Legend at 0x7f47b03fd518>

This plot shows that while over the years the value of starting pitchers and the value of batters have traded places, since the early 200's the value of starting pitchers have trended upward signficantly, thus giving us the conclusion that starting pitchers are now valued more than batters. Relief pitchers, on the other hand, are valued much lower than both starting pitchers and batters.

Tracking Starting Pitching Statistics

Since starting pitchers have recently been getting paid more in the last five years, we want to see what is valued in a pitcher that causes them to get larger contracts.

In [10]:
# plot starting pitchers strikeouts over time
starting_so = starting_pitchers.groupby("yearID")["SO"].sum()
st_so_plot = starting_so.plot.line(title="Starting Pitcher Strikeouts Over Time") 
st_so_plot.set_ylabel("Strikeouts")
st_so_plot.set_xlabel("Year")
Out[10]:
Text(0.5, 0, 'Year')

This graph strongly resembles the graph for all pitchers strikeouts, therefore there are no extra conclusions to be taken from this graph

In [11]:
# plot starting pitchers earned run average over time
starting_era = starting_pitchers.groupby("yearID")["ERA"].mean()
st_era_plot = starting_era.plot.line(title="Starting Pitcher Earned Run Average Over Time") 
st_era_plot.set_ylabel("Earned Run Average")
st_era_plot.set_xlabel("Year")
Out[11]:
Text(0.5, 0, 'Year')

As with strikeouts, this graph also resembles the graph for all pitchers earned run average, therefore there is no conclusion that we can draw from this graph

With starting pitchers, their record is an important statistic so we are going to include that in the study as an extra statistic to track

In [12]:
# Make new column called win percentage
winpct = []
for row in starting_pitchers.iterrows():
    wins = row[1]["W"]
    games = row[1]["GS"]
    pct = wins/games
    winpct = winpct + [pct]
starting_pitchers["win percentage"] = winpct

# plot starting pitchers win percentage over time
starting_winpct = starting_pitchers.groupby("yearID")["win percentage"].mean()
st_winpct_plot = starting_winpct.plot.line(title="Starting Pitcher Win Percentage Over Time") 
st_winpct_plot.set_ylabel("Win Percentage")
st_winpct_plot.set_xlabel("Year")
Out[12]:
Text(0.5, 0, 'Year')

According to this graph, we can see that starting pitcher win percentage has actually decreased over the years, so that is most likely not the reason that pitchers are being paid more, but that fact is not necessarily conclusive so we will do more investigation

In conclusion for this section, since there is no concrete discovery that we can draw from these graphs, we will have to take a different approach in trying to discover why starting pitchers have been so valuable recently

Identifying Statistics Leaders in the Last Five Years

We want to identify the league leaders at each position in the last five years to identify the prime statistic on which pitchers are paid

In [13]:
# get data frame with starting pitchers just from the last five years
starting_pitchers_recent = starting_pitchers[starting_pitchers["yearID"] > 2011]

# take top 100 earned run average pitcher seasons 
era_leaders = starting_pitchers_recent.sort_values(by=["ERA"])[:100]
print ("\nTop 10 Leaders in ERA")
display(era_leaders[:10])
# take the mean salary for these pitchers
mean_era_salary = era_leaders["standard_salary"].mean()

# take top 100 strikeout pitcher seasons 
so_leaders = starting_pitchers_recent.sort_values(by=["SO"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Strikeouts")
display(so_leaders[:10])
# take the mean salary for these pitchers
mean_so_salary = so_leaders["standard_salary"].mean()

# take top 100 win percentage pitcher seasons 
winpct_leaders = starting_pitchers_recent.sort_values(by=["win percentage"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Win Percentage")
display(winpct_leaders[:10])
# take the mean salary for these pitchers
mean_winpct_salary = winpct_leaders["standard_salary"].mean()

data = [mean_era_salary, mean_so_salary, mean_winpct_salary]
ind = ["ERA", "Strikeouts", "Win Pct"]
mean_salary = pd.Series(data, index=ind) 
mean_statistic_bar = mean_salary.plot.bar(title="Mean Standard Salary for Statistic Leaders Last Five Years")
mean_statistic_bar.set_ylabel("Standard Salary")
Top 10 Leaders in ERA
playerID yearID stint teamID lgID W L G GS CG ... BFP GF R SH SF GIDP nameFirst nameLast standard_salary win percentage
41666 medlekr01 2012 1 ATL NL 10 1 50 12 2 ... 520.0 7 26 1.0 0.0 13.0 Kris Medlen -0.630196 0.833333
43717 greinza01 2015 1 LAN NL 19 3 32 32 1 ... 843.0 0 43 6.0 2.0 15.0 Zack Greinke 3.759182 0.593750
41738 paulife01 2012 1 KCA AL 3 1 7 7 0 ... 156.0 0 8 1.0 1.0 1.0 Felipe Paulino -0.330853 0.428571
44618 kershcl01 2016 1 LAN NL 12 4 21 21 3 ... 544.0 0 31 4.0 1.0 9.0 Clayton Kershaw 4.848037 0.571429
43318 smylydr01 2014 2 TBA AL 3 1 7 7 1 ... 173.0 0 9 1.0 0.0 3.0 Drew Smyly -0.671235 0.428571
42032 buchhcl01 2013 1 BOS AL 12 1 16 16 1 ... 416.0 0 23 1.0 2.0 14.0 Clay Buchholz 0.408289 0.750000
43033 kershcl01 2014 1 LAN NL 21 3 27 27 6 ... 749.0 0 42 6.0 1.0 15.0 Clayton Kershaw 0.003793 0.777778
43453 arrieja01 2015 1 CHN NL 22 6 33 33 4 ... 870.0 0 52 4.0 1.0 15.0 Jake Arrieta -0.121913 0.666667
44568 hillri01 2016 2 LAN NL 3 2 6 6 0 ... 128.0 0 7 1.0 1.0 0.0 Rich Hill 0.271793 0.500000
42301 kershcl01 2013 1 LAN NL 16 9 33 33 3 ... 908.0 0 55 8.0 3.0 20.0 Clayton Kershaw 1.617047 0.484848

10 rows × 34 columns


Top 10 Leaders in Strikeouts
playerID yearID stint teamID lgID W L G GS CG ... BFP GF R SH SF GIDP nameFirst nameLast standard_salary win percentage
43821 kershcl01 2015 1 LAN NL 16 7 33 33 4 ... 890.0 0 62 4.0 0.0 8.0 Clayton Kershaw 5.134183 0.484848
44916 scherma01 2016 1 WAS NL 20 7 34 34 1 ... 902.0 0 77 7.0 3.0 6.0 Max Scherzer 3.007854 0.588235
42109 darviyu01 2013 1 TEX AL 13 9 32 32 0 ... 841.0 0 68 0.0 5.0 17.0 Yu Darvish 1.163763 0.406250
44108 scherma01 2015 1 WAS NL 14 12 33 33 4 ... 899.0 0 74 11.0 2.0 5.0 Max Scherzer 2.332058 0.424242
44097 salech01 2015 1 CHA AL 13 11 31 31 1 ... 854.0 0 88 2.0 3.0 13.0 Chris Sale 0.308512 0.419355
43040 klubeco01 2014 1 CLE AL 18 9 34 34 3 ... 951.0 0 72 5.0 2.0 16.0 Corey Kluber -0.672399 0.529412
45006 verlaju01 2016 1 DET AL 16 9 34 34 2 ... 903.0 0 81 4.0 7.0 8.0 Justin Verlander 4.000584 0.470588
44460 fernajo02 2016 1 MIA NL 16 8 29 29 0 ... 737.0 0 63 7.0 5.0 14.0 Jose Fernandez -0.270576 0.551724
43450 archech01 2015 1 TBA AL 12 13 34 34 1 ... 868.0 0 85 2.0 2.0 10.0 Chris Archer -0.569289 0.352941
43297 scherma01 2014 1 DET AL 18 5 33 33 1 ... 904.0 0 80 4.0 8.0 15.0 Max Scherzer 2.239339 0.545455

10 rows × 34 columns


Top 10 Leaders in Win Percentage
playerID yearID stint teamID lgID W L G GS CG ... BFP GF R SH SF GIDP nameFirst nameLast standard_salary win percentage
43124 mejiaje01 2014 1 NYN NL 6 6 63 7 0 ... 417.0 49 41 2.0 0.0 10.0 Jenrry Mejia -0.673238 0.857143
44760 nicasju01 2016 1 PIT NL 10 7 52 12 0 ... 513.0 9 64 5.0 7.0 8.0 Juan Nicasio -0.236678 0.833333
41666 medlekr01 2012 1 ATL NL 10 1 50 12 2 ... 520.0 7 26 1.0 0.0 13.0 Kris Medlen -0.630196 0.833333
44017 priceda01 2015 2 TOR AL 9 1 11 11 0 ... 296.0 0 20 0.0 3.0 4.0 David Price 2.805707 0.818182
43305 shoemma01 2014 1 LAA AL 16 4 27 20 0 ... 543.0 5 49 3.0 5.0 9.0 Matt Shoemaker -0.675018 0.800000
43033 kershcl01 2014 1 LAN NL 21 3 27 27 6 ... 749.0 0 42 6.0 1.0 15.0 Clayton Kershaw 0.003793 0.777778
43915 medlekr01 2015 1 KCA AL 6 2 15 8 0 ... 243.0 2 30 0.0 1.0 4.0 Kris Medlen -0.417944 0.750000
42032 buchhcl01 2013 1 BOS AL 12 1 16 16 1 ... 416.0 0 23 1.0 2.0 14.0 Clay Buchholz 0.408289 0.750000
44324 bundydy01 2016 1 BAL AL 10 6 36 14 0 ... 474.0 6 52 1.0 1.0 6.0 Dylan Bundy -0.440067 0.714286
44165 tomlijo01 2015 1 CLE AL 7 2 10 10 2 ... 251.0 0 22 0.0 0.0 3.0 Josh Tomlin -0.508751 0.700000

10 rows × 34 columns

Out[13]:
Text(0, 0.5, 'Standard Salary')

From this bar graph, it is clear to see that teams value strikeouts more than earned run average and win percentage, most likely because strikeouts are one of the main measures of a pitchers ability to complete dominate batters

Retrieve and Tidy Team Data

I need to retrieve the teams data from Teams.csv because I will need data about their wins and losses from the past five years

In [14]:
# read in teams data from csv file
teams = pd.read_csv("./data/Teams.csv")
# cut out all of the records that are in the year 2011 or before
recent_teams = teams[teams["yearID"] > 2011]

# create new column win percentage
winpcts = []
for row in recent_teams.iterrows():
    wins = row[1]["W"]
    games = row[1]["G"]
    pct = wins/games
    winpcts = winpcts + [pct]
recent_teams["Win Percentage"] = winpcts

# display teams dataframe
display(recent_teams.head())
yearID lgID teamID franchID divID Rank G Ghome W L ... FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro Win Percentage
2685 2012 NL ARI ARI W 3 162 81.0 81 81 ... 0.985 Arizona Diamondbacks Chase Field 2177617.0 105 106 ARI ARI ARI 0.500000
2686 2012 NL ATL ATL E 2 162 81.0 94 68 ... 0.986 Atlanta Braves Turner Field 2420171.0 102 101 ATL ATL ATL 0.580247
2687 2012 AL BAL BAL E 2 162 81.0 93 69 ... 0.983 Baltimore Orioles Oriole Park at Camden Yards 2102240.0 102 103 BAL BAL BAL 0.574074
2688 2012 AL BOS BOS E 5 162 81.0 69 93 ... 0.983 Boston Red Sox Fenway Park II 3043003.0 106 106 BOS BOS BOS 0.425926
2689 2012 AL CHA CHW C 2 162 81.0 85 77 ... 0.988 Chicago White Sox U.S. Cellular Field 1965955.0 106 106 CHW CHA CHA 0.524691

5 rows × 49 columns

View Relationship between Starting Pitcher Salary and Win Percentage

We want to see the amount of money that each team spends on their starting pitching staff in the past five years

In [15]:
# get mean starting pitcher salary for each team from the past five years
team_starter_salaries = starting_pitchers_recent.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_starter_salaries = team_starter_salaries.groupby("teamID").mean()

# plot salaries on a bar graph
team_salary_bar = mean_team_starter_salaries.plot.bar(title = "Mean Starting Pitcher Standard Salary in the Past Five Years")
team_salary_bar.set_ylabel("Mean Standard Salary")
Out[15]:
Text(0, 0.5, 'Mean Standard Salary')

From this bar graph, it is clear than teams such as the San Francisco Giants, the Detroit Tigers, and the Los Angeles Dodgers have all spent a lot of money on starting pitching. However, this information is not useful until we can see this spending as a proportion of their total payroll

In [16]:
# get dataframe with recent relief pitching
recent_relief_pitching = relief_pitchers[relief_pitchers["yearID"] > 2011]

# get dataframe with recent batting
recent_batting = batters[batters["yearID"] > 2011]

# get series with mean relief pitcher salaries
team_reliever_salaries = recent_relief_pitching.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_reliever_salaries = team_reliever_salaries.groupby("teamID").mean()

# get series with mean batting salaries
team_batting_salaries = recent_batting.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_batting_salaries = team_batting_salaries.groupby("teamID").mean()

# get series with mean total salary
mean_total_salaries = mean_team_starter_salaries + mean_team_reliever_salaries + mean_team_batting_salaries

# get series with proportion of payroll spent on starting pitching
proportion_starter_salary = mean_team_starter_salaries/mean_total_salaries

# plot the proportion series
proportion_plot = proportion_starter_salary.plot.bar()
proportion_plot.set_title("Proportion of Payroll Spent on Starting Pitching in Last Five Years")
Out[16]:
Text(0.5, 1.0, 'Proportion of Payroll Spent on Starting Pitching in Last Five Years')

With this graph it now gives an entirely new picture and we can see that it is actually the Seattle Mariners and Chicago Cubs who spend the most on their starting pitching. The Cincinnati Reds and the Kansas City Royals spend the least amount of their payroll on starting pitching

Now that we have seen how much each team spends on starting pitching, we also want to see how this correlates with their winning percentage

In [17]:
# get mean winning percentage for each time in the last five years
team_winning_pct = recent_teams.groupby("teamID")["Win Percentage"].mean()
# combine the two series into one dataframe
percentage_df = pd.concat([proportion_starter_salary, team_winning_pct], axis=1)
# rename win percentage column
percentage_df = percentage_df.rename(columns={'Win Percentage': 'win percentage', 'standard_salary':'starting pitcher salary proportion'})
# make bar plot with winning percentage and salary
percentage_df.plot.bar(title = "Winning Percentage Compared to Proportion of Payroll Spent on Starters",figsize=(8,6))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f478b0847b8>

This bar graph does not necessarily give us anything conclusive about the relationship between winning percentage and proportion of payroll spent on starting pitchers, so instead we can try to make a scatter plot to see the correlation

Finding Correlation between Proportion of Money Spent on each Position and Win Percentage

In [18]:
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["starting pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["starting pitcher salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("starting pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Starters from 2012 to 2016")
Out[18]:
Text(0.5, 1.0, 'Winning Percentage Compared to Proportion of Payroll Spent on Starters from 2012 to 2016')

From the scatter plot, we can now see that there is positive correlation between the proportion of payroll spent on starting pitchers and the winning percentage of a team. We now want to compare this scatter plot to the proportion of payroll spent on relief pitching and the proportion of payroll spent on batters

In [19]:
# get proportion of payroll spent on batting
proportion_batter_salary = mean_team_batting_salaries/mean_total_salaries
# get proportion of payroll spent on relievers
proportion_reliever_salary = mean_team_reliever_salaries/mean_total_salaries

# add two new series to the proportions dataframe
percentage_df["relief pitcher salary proportion"] = proportion_reliever_salary
percentage_df["batter salary proportion"] = proportion_batter_salary

# make a scatter plot with reliever proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["relief pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["relief pitcher salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("relief pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Relievers from 2012 to 2016")

# make a scatter plot with batter proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["batter salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["batter salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("batter salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Batters from 2012 t0 2016")
Out[19]:
Text(0.5, 1.0, 'Winning Percentage Compared to Proportion of Payroll Spent on Batters from 2012 t0 2016')

From the relief pitcher and batter graphs, we can see that there is also positive correlation between paying batters and winning percentage. However, we can also see that there is negative correlation between paying relief pitchers and winning percentage, so this suggests that teams should be paying starting pitchers and batters, but should not be paying relief pitchers. A couple of the more interesting points are that the Cincinnati Reds spend a lot of money on relief pitching and not as much on starting pitching and they are not as successful. Meanwhile, two of the most successful teams over the past five years, the Los Angeles Dodgers and the Washington Nationals, spend a large proportion of their payroll on starting pitching and batting, and barely any money on relief pitching. Based on the scatter plot, it is difficult to see which position has more correlation between starting pitching and batting, so we want to get the actual correlation value.

In [20]:
# Get correlation between starting pitcher salary proportion and winning percentage
starter_corr = percentage_df["win percentage"].corr(percentage_df["starting pitcher salary proportion"])

# Get correlation between relief pitcher salary proportion and winning percentage
reliever_corr = percentage_df["win percentage"].corr(percentage_df["relief pitcher salary proportion"])

# Get correlation between batter salary proportion and winning percentage
batter_corr = percentage_df["win percentage"].corr(percentage_df["batter salary proportion"])

# print output statements with the correlations
print("The Correlation between Win Percentage and Starting Pitcher Salary Proportion is " + str(starter_corr))
print("The Correlation between Win Percentage and Relief Pitcher Salary Proportion is " + str(reliever_corr))
print("The Correlation between Win Percentage and Batter Salary Proportion is " + str(batter_corr))
The Correlation between Win Percentage and Starting Pitcher Salary Proportion is 0.29184971077334454
The Correlation between Win Percentage and Relief Pitcher Salary Proportion is -0.32338425998851844
The Correlation between Win Percentage and Batter Salary Proportion is 0.2617699537454118

These results tell us that there is a higher correlation between starting pitching and winning percentage and win percentage than between batting and winning percentage. The relief pitching and winning percentage correlation is negative.

Relief Pitcher Conclusion: At this point we can draw a final conclusion on relief pitchers. Teams do not pay relief pitchers very much of their payroll, which is the right move since paying relief pitchers a lot of money is not a good use of resources due to the negative correlation between proportion of money spent on relief pitchers and winning percentage. Starting pitchers and batters are paid more because they have more of an impact on team success

Due to the fact that we have reached a conclusion on relief pitching, we will now proceed with just comparing starting pitching and batting. While relief pitching might still be included in graphs and tables, I will not spend as much time on that analysis

Should Starting Pitchers or Batters Get Paid More Money?

In this next stage we are going to compare starting pitchers and batters in the past five years and see how the correlation between paying each position and winning percentage has developed recently

In [21]:
team_winning_pct = recent_teams.groupby(["yearID", "teamID"])["Win Percentage"].mean()
team_2016 = team_winning_pct[2016]

starting_pitching_2016 = starting_pitchers[starting_pitchers["yearID"] == 2016]
batters_2016 = batters[batters["yearID"] == 2016]
relief_pitching_2016 = relief_pitchers[relief_pitchers["yearID"] == 2016]

# get series with mean starting pitcher salaries from 2016
mean_team_starters_2016 = starting_pitching_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter salaries from 2016
mean_team_batters_2016 = batters_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter salaries from 2016
mean_team_relievers_2016 = relief_pitching_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter plus starting pitcher salaries
mean_team_batters_starters_2016 = mean_team_starters_2016 + mean_team_batters_2016 + mean_team_relievers_2016

# get series with proportion of payroll spent on starting pitching
prop_starters = mean_team_starters_2016/mean_team_batters_starters_2016

# get series with proportion of payroll spent on starting pitching
prop_batters = mean_team_batters_2016/mean_team_batters_starters_2016

prop_relievers = mean_team_relievers_2016/mean_team_batters_starters_2016

# create dataframe with proportions
prop_df = pd.concat([prop_batters, team_2016], axis=1)
prop_df = prop_df.rename(columns={'Win Percentage': 'win percentage 2016', 'standard_salary':'batter 2016'})
prop_df =  pd.concat([prop_df, prop_starters], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'starter 2016'})

prop_df =  pd.concat([prop_df, prop_relievers], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'reliever 2016'})

# do the same for all years 2012 to 2015
for year in range(2012, 2016):
    team_year = team_winning_pct[year]
    starting_pitching_year = starting_pitchers[starting_pitchers["yearID"] == year]
    batters_year = batters[batters["yearID"] == year]
    relief_pitching_year = relief_pitchers[relief_pitchers["yearID"] == year]
    
    # get series with mean starting pitcher salaries from each year
    mean_team_starters_year = starting_pitching_year.groupby("teamID")["standard_salary"].mean()

    # get series with mean batter salaries from each year
    mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
    
    # get series with mean batter salaries from each year
    mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
    
    mean_team_relievers_year = relief_pitching_year.groupby("teamID")["standard_salary"].mean()

    # get series with mean batter plus starting pitcher salaries
    mean_team_batters_starters_year = mean_team_starters_year + mean_team_batters_year + mean_team_relievers_year

    # get series with proportion of payroll spent on starting pitching
    prop_starters = mean_team_starters_year/mean_team_batters_starters_year

    # get series with proportion of payroll spent on starting pitching
    prop_batters = mean_team_batters_year/mean_team_batters_starters_year
    
    prop_relievers = mean_team_relievers_year/mean_team_batters_starters_year
    
    # add to proportions dataframe
    prop_df = pd.concat([prop_df, prop_batters, team_year], axis=1)
    win_col = "win percentage " + str(year)
    batter_col = "batter " + str(year)
    starter_col = "starter " + str(year)
    reliever_col = "reliever " + str(year)
    prop_df = prop_df.rename(columns={'Win Percentage': win_col, 'standard_salary':batter_col})
    prop_df =  pd.concat([prop_df, prop_starters], axis=1)
    prop_df = prop_df.rename(columns={'standard_salary':starter_col})
    prop_df =  pd.concat([prop_df, prop_relievers], axis=1)
    prop_df = prop_df.rename(columns={'standard_salary':reliever_col})
display(prop_df.head())

# get correlations between win percentage, and starter and batter salary proportion
batter_corrs = []
starter_corrs = []
reliever_corrs = []
ind = []
for year in range(2012,2017):
    win_year = "win percentage " + str(year)
    batter_year = "batter " + str(year)
    starter_year = "starter " + str(year)
    reliever_year = "reliever " + str(year)
    batter_corr = prop_df[win_year].corr(prop_df[batter_year])
    starter_corr = prop_df[win_year].corr(prop_df[starter_year])
    reliever_corr = prop_df[win_year].corr(prop_df[reliever_year])
    batter_corrs = batter_corrs + [batter_corr]
    starter_corrs = starter_corrs + [starter_corr]
    reliever_corrs = reliever_corrs + [reliever_corr]
    ind = ind + [year]

# make a series for each set of correlations indexed by year
batter_correlations = pd.Series(batter_corrs, index=ind)
starter_correlations = pd.Series(starter_corrs, index=ind)
reliever_correlations = pd.Series(reliever_corrs, index=ind)

# plot the correlations on a line graph
correlations_plot = batter_correlations.plot.line(legend=True)
starter_correlations.plot.line(ax=correlations_plot, legend=True)
reliever_correlations.plot.line(ax=correlations_plot, legend=True)
correlations_plot.set_title("Salary Proportion and Winning Percentage Correlations From 2012 to 2016")
correlations_plot.set_ylabel("Correlation")
correlations_plot.set_xlabel("Year")
correlations_plot.legend(["Batters", "Starting Pitchers", "Relief Pitchers"])
batter 2016 win percentage 2016 starter 2016 reliever 2016 batter 2012 win percentage 2012 starter 2012 reliever 2012 batter 2013 win percentage 2013 starter 2013 reliever 2013 batter 2014 win percentage 2014 starter 2014 reliever 2014 batter 2015 win percentage 2015 starter 2015 reliever 2015
teamID
ARI 6.281737 0.425926 -10.539830 5.258093 -0.193050 0.500000 0.564318 0.628732 0.142761 0.500000 0.431176 0.426063 -0.941504 0.395062 0.173635 1.767869 0.213295 0.487654 0.418064 0.368642
ATL -0.197719 0.422360 0.602862 0.594857 -0.658569 0.580247 0.393583 1.264986 -1.187657 0.592593 -0.093550 2.281207 -1.373686 0.487654 0.429272 1.944414 -1.116344 0.413580 2.624507 -0.508164
BAL 1.043369 0.549383 0.579936 -0.623305 -0.033096 0.574074 0.552530 0.480566 -2.335152 0.524691 0.436545 2.898606 -1.816925 0.592593 0.305406 2.511519 0.031172 0.500000 -3.507200 4.476027
BOS 0.320255 0.574074 0.845885 -0.166140 0.989502 0.425926 0.536711 -0.526213 0.312833 0.598765 0.797508 -0.110340 0.493430 0.438272 0.783125 -0.276555 1.069860 0.481481 0.622152 -0.692013
CHA 0.081412 0.481481 0.966871 -0.048284 0.582296 0.524691 0.453693 -0.035989 0.645576 0.388889 0.729358 -0.374934 -0.598662 0.450617 2.858416 -1.259753 0.025301 0.469136 1.146613 -0.171915
Out[21]:
<matplotlib.legend.Legend at 0x7f478ab28198>

This graph shows the correlation between proportion of salary spent on batters and winning percentage over time and correlation between proportion of salary spent on starting pitchers and winning percentage over time. From this graph, it is clear to see that over the past five years, the correlation between paying starting pitchers and winning has trended up, and the correlation between paying batters and winning increased for a couple years but is now trending down. The correlation between relief pitchers and winning is also on this graph, but as we could have guessed from out earlier conclusion, it is trending way down. We do not have any salary data for any year after 2017, but if this was the year 2017 it is clear that it is best to pay starting pitchers more money because recently spending money on starting pitching has paid off.

Build K Nearest Neighbors Predictive Model

We want to have a model that can predict win percentage based on the proportion of payroll a team uses at each position

In [22]:
# create new dataframe with salarie proportions and winning percentage
winning_percentages = []
relief_pitcher_sal = []
starting_pitcher_sal = []
batter_sal = []
for row in prop_df.iterrows():
    for year in range(2012,2017):
        win_year = "win percentage " + str(year)
        batter_year = "batter " + str(year)
        starter_year = "starter " + str(year)
        reliever_year = "reliever " + str(year)
        winning_percentages = winning_percentages + [row[1][win_year]]
        batter_sal = batter_sal + [row[1][batter_year]]
        starting_pitcher_sal = starting_pitcher_sal + [row[1][starter_year]]
        relief_pitcher_sal = relief_pitcher_sal + [row[1][reliever_year]]
data = {"relief_pitcher_proportion":relief_pitcher_sal, "starting_pitcher_proportion":starting_pitcher_sal, 
        "batter_proportion": batter_sal, "winning_percentage":winning_percentages}
salary_proportions = pd.DataFrame(data)
display(salary_proportions.head())
relief_pitcher_proportion starting_pitcher_proportion batter_proportion winning_percentage
0 0.628732 0.564318 -0.193050 0.500000
1 0.426063 0.431176 0.142761 0.500000
2 1.767869 0.173635 -0.941504 0.395062
3 0.368642 0.418064 0.213295 0.487654
4 5.258093 -10.539830 6.281737 0.425926
In [23]:
# Make model using newly created dataframe
salary_proportions

# set features and training data
features = ["relief_pitcher_proportion", "starting_pitcher_proportion", "batter_proportion"]
Xtrain = pd.get_dummies(salary_proportions[features])
Ytrain = salary_proportions["winning_percentage"]

# Create series with the given data
Xnew = pd.Series(index=Xtrain.columns)
Xnew["starting_pitcher_proportion"] = 0.5
Xnew["batter_proportion"] = 0.5
Xnew["relief_pitcher_proportion"] = 0


# Standardize the variables.
Xtrain_mean = Xtrain.mean()
Xtrain_std = Xtrain.std()

Xtrain_sc = (Xtrain - Xtrain_mean) / Xtrain_std
Xnew_sc = (Xnew - Xtrain_mean) / Xtrain_std

# Find index of 30 nearest neighbors.
dists = np.sqrt(((Xtrain_sc - Xnew_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:30].index

# Average the labels of these 30 nearest neighbors
Ytrain.loc[i_nearest].mean()
Out[23]:
0.5062929734427318

According to this model, if you spend the mean salary for they year on relief pitching, and then split the rest of your salary between starting pitching and batting, you will win approximately half of your games.

Conclusion

My original hypothesis was that starting pitchers are paid the most, followed by batters and then relief pitchers. I was correct in that recently starting pitchers have been getting paid the most, followed closely by batters, with relief pitchers way below them in terms of average salary. I also believed that teams value earned run average over other statistics, but I was wrong and it turns out that actually strikeouts are the most valued statistic by teams when they pay their pitchers. I also went on to discover that paying relief pitchers too much money can really hurt a teams chances of winning, and it is best to pay starting pitchers because that will bring you the most wins. Starting pitchers and batters are both paid fairly, neither are overpaid or underpaid. Therefore, starting pitchers are the most valued position, but batters are still very valuable, so you should try to pay those two positions a good part of your payroll, while trying to minimize the amount of money spent on relief pitching. This is supported by the Washington Nationals, who just won the World Series even though their relief pitchers were ranked at the bottom of the league.

In [ ]: