Baseball teams are usually divided into two different parts: their offense led by their batters, and their defense led by their pitchers. Every offseason, Major League Baseball teams have to make an important decision: Should they spend their money on their batting or their pitching? In this study, I am going to attempt to form an answer to this question by observing how teams value their pitchers and whether they should pay them more or less. Earlier this semester, we did an investigation of payroll efficiency for each team. This study is going to take this a step further by investigating how a team can achieve payroll efficiency by how much they should allocate to their pitching staff. While we found some interesting information in our moneyball project about how efficient teams have been, this study should provide a whole new level of strategy for general managers on which position to pay in order to achieve efficiency.
My hypothesis is that starting pitchers will be valued more than batters, and both will be valued more than relief pitchers. The most sought after statistic is earned run average, since that reflects a pitchers ability to beat the opposition. I also believe that starting pitchers are underpaid and batters are overpaid.
I am retrieving my data from csv files located in Lahman's Baseball Database. In this cell, I will read in the data from four csv files: Pitching.csv, Batting.csv, Salaries.csv, People.csv. After that, I will do some joins so that the full player name is in the batting and pitching dataframe and also get a pitchers and batters salaries dataframe. During this process, I will standardize the payrolls across years to account for the rise in payroll rate
# read in salaries.csv, pitching.csv, and people.csv into dataframes
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
salaries_df = pd.read_csv("./data/Salaries.csv")
pitching_df = pd.read_csv("./data/Pitching.csv")
batting_df = pd.read_csv("./data/Batting.csv")
people_df = pd.read_csv("./data/People.csv")
# get salary aggregations
year_salary_aggs = salaries_df.groupby(["yearID"]).agg(["mean", "std"])
year_salary_aggs = year_salary_aggs["salary"]
# create new column standard salary
standardized_salaries = []
for row in salaries_df.iterrows():
yearID = row[1]["yearID"]
salary = row[1]["salary"]
mean = year_salary_aggs.loc[yearID]["mean"]
std = year_salary_aggs.loc[yearID]["std"]
standardized_salary = (salary - mean)/std
standardized_salaries = standardized_salaries + [standardized_salary]
salaries_df["standard_salary"] = standardized_salaries
# join pitching and people so that the player name is now part of pitching dataframe
pitching_df_names = pd.merge(pitching_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")
# get pitcher salary into pitching_df
pitching = pd.merge(pitching_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="left")
# trim pitching salaries to only get years with relevant salary data
pitching_salaries = pitching[pitching["yearID"] >= 1985]
# drop any record that has NaN for a column
pitchers = pitching_salaries.dropna()
print("\nPitcher Statistics")
display(pitchers.head())
# join batting and people so that the player name is part of batting dataframe
batting_df_names = pd.merge(batting_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")
# get batter salary into batting dataframe
batting = pd.merge(batting_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="inner")
# trim pitching salaries to only get years with relevant salary data
batting_salaries = batting[batting["yearID"] >= 1985]
# trim batters with less than 100 At Bats to get rid of pitchers from that table
batter_salaries = batting_salaries[batting_salaries["AB"] > 100]
# drop any record that has NaN for a column
batters = batter_salaries.dropna()
print("\n\nBatter Statistics")
display(batters.head())
We also want to separate starting pitchers and relief pitchers for future analysis. Since there are six months in the major league baseball season, we will classify pitchers as those who did not have at least one start a month, so any players that have at least six games started is a starting pitcher, any pitcher with fewer than six games started is a relief pitcher
# starting pitchers are all the pitchers that have at least 5 games started
starting_pitchers = pitchers[pitchers["GS"] >= 6]
print("\nStarting Pitcher Statistics")
display(starting_pitchers.head())
# relief pitchers are all the pitchers that have less than 5 games started
relief_pitchers = pitchers[pitchers["GS"] < 6]
print("\n\nRelief Pitcher Statistics")
display(relief_pitchers.head())
To provide a bit of context, I am going to set up a few graphs tracking some important statistics for pitchers such as earned run average and strikeouts, and for batters such as batting average and home runs
Use line graphs to track earned run average and strikeouts over time
# plot total strikeouts over time on line graph
total_strikeouts = pitchers.groupby("yearID")["SO"].sum()
so_plot = total_strikeouts.plot.line(title="Total Strikeouts Over Time")
so_plot.set_ylabel("Total Strikeouts")
so_plot.set_xlabel("Year")
From the graph we can see that since 1985 total strikeouts have steadily increased with a massive rise around 1995 followed by a relatively small decline a couple of years after
# plot earned run average over time on line graph
mean_era = pitchers.groupby("yearID")["ERA"].mean()
era_plot = mean_era.plot.line(title="Earned Run Average Over Time")
era_plot.set_ylabel("ERA")
era_plot.set_xlabel("Year")
From the graph we can see that ERA was increased up until around its peak in 2000, and since then it has steadily decreased, which means that pitching has improved since 2000
# create new column Batting Avg
avgs = []
for row in batters.iterrows():
hits = row[1]["H"]
atbats = row[1]["AB"]
avg = hits/atbats
avgs = avgs + [avg]
batters["AVG"] = avgs
# plot batting average over time on line graph
batting_average = batters.groupby("yearID")["AVG"].mean()
ba_plot = batting_average.plot.line(title="Batting Average Over Time")
ba_plot.set_ylabel("Batting Average")
ba_plot.set_xlabel("Year")
From the graph we can see that batting average had a major increase in the early 1990's and stayed high for a while until the late 2000's in which it had a precipitous drop and has been very low in the recent past
# plot batting average over time on line graph
home_runs = batters.groupby("yearID")["HR"].mean()
hr_plot = home_runs.plot.line(title="Home Runs Over Time")
hr_plot.set_ylabel("Home Runs")
hr_plot.set_xlabel("Year")
From the graph we can see that while home runs were increasing since 1990 they started to decrease starting around 2005 but recently there has been a massive increase in the number of home runs hit per season. The spike in the middle of the graph is most likely a result of the steroid era.
From analyzing these graphs, we can see that in the past decade there has been a decrease in batting average and earned run average, but an increase in home runs and strikeouts. These trends can be attributed to two main factors. The first factor is that batters are now attempting to hit for power instead of hitting for contact, which means that while there would be an increase in home runs, batting average would decrease and strikeouts would decrease. The second factor is that pitching has been improving, which would lower batting average and earned run average, while also increasing strikeouts.
Use line graphs to compare average pitcher salary to average total player salary from 1985 to 2016
# get mean pitcher salaries
mean_pitcher_salaries = pitchers.groupby("yearID")["standard_salary"].mean()
# get mean batter salaries
mean_batter_salaries = batters.groupby("yearID")["standard_salary"].mean()
# plot mean salaries
mean_plot = mean_pitcher_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_plot, legend=True)
mean_plot.set_title("Mean Standard Salaries Over Time")
mean_plot.set_ylabel("Mean Standard Salary")
mean_plot.set_xlabel("Year")
mean_plot.legend(["Pitchers", "Batters"])
From this graph we can see that since 1985 batters have always gotten paid more than pitchers, and as the years have progressed there has not been a very big change in how batters and pitchers have been paid. Interestingly enough, in the 1990's there actually was a time in which the value of batters increased and the value of pitchers decreased, but during the turn of the century this evened back out and now in the present day pitchers and batters are valued the same way that they were in 1985
One of the reasons that pitchers may be getting paid less than batters is that relief pitchers are often valued less than other positions, which would lower the overall mean pitcher salary. Therefore, in this part we want to compare salary for starting pitchers and relief pitchers to see if there is a difference
# get mean starting pitcher salaries
mean_starter_salaries = starting_pitchers.groupby("yearID")["standard_salary"].mean()
# get mean relief pitchers salaries
mean_relief_salaries = relief_pitchers.groupby("yearID")["standard_salary"].mean()
# plot mean starting salaries and mean relief salaries over time
mean_pitchers_plot = mean_starter_salaries.plot.line(legend=True)
mean_relief_salaries.plot.line(ax=mean_pitchers_plot, legend=True)
mean_pitchers_plot.set_title("Mean Pitcher Standard Salaries Over Time")
mean_pitchers_plot.set_ylabel("Mean Standard Salary")
mean_pitchers_plot.set_xlabel("Year")
mean_pitchers_plot.legend(["Starting Pitchers", "Relief Pitchers"])
This plot shows that while in 1985 relief pitchers and starting pitchers were relatively close in how they were paid, since 1985 the value of starting pitchers has trended upwards, while the value of relief pitchers have trended downward, and it shows that starting pitchers are deemed more valuable than relief pitchers by a lot in the present day.
Now that we have seen that there is indeed a difference between starting pitchers and relief pitchers, we can compare starting pitchers, relief pitchers and batters to see the complete order in how they are valued
# plot mean starting salaries and mean relief salaries over time
mean_starters_plot = mean_starter_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_relief_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_starters_plot.set_title("Mean Standard Salary Over Time")
mean_starters_plot.set_ylabel("Mean Standard Salary")
mean_starters_plot.set_xlabel("Year")
mean_starters_plot.legend(["Starting Pitchers", "Batters", "Relief Pitchers"])
This plot shows that while over the years the value of starting pitchers and the value of batters have traded places, since the early 200's the value of starting pitchers have trended upward signficantly, thus giving us the conclusion that starting pitchers are now valued more than batters. Relief pitchers, on the other hand, are valued much lower than both starting pitchers and batters.
Since starting pitchers have recently been getting paid more in the last five years, we want to see what is valued in a pitcher that causes them to get larger contracts.
# plot starting pitchers strikeouts over time
starting_so = starting_pitchers.groupby("yearID")["SO"].sum()
st_so_plot = starting_so.plot.line(title="Starting Pitcher Strikeouts Over Time")
st_so_plot.set_ylabel("Strikeouts")
st_so_plot.set_xlabel("Year")
This graph strongly resembles the graph for all pitchers strikeouts, therefore there are no extra conclusions to be taken from this graph
# plot starting pitchers earned run average over time
starting_era = starting_pitchers.groupby("yearID")["ERA"].mean()
st_era_plot = starting_era.plot.line(title="Starting Pitcher Earned Run Average Over Time")
st_era_plot.set_ylabel("Earned Run Average")
st_era_plot.set_xlabel("Year")
As with strikeouts, this graph also resembles the graph for all pitchers earned run average, therefore there is no conclusion that we can draw from this graph
With starting pitchers, their record is an important statistic so we are going to include that in the study as an extra statistic to track
# Make new column called win percentage
winpct = []
for row in starting_pitchers.iterrows():
wins = row[1]["W"]
games = row[1]["GS"]
pct = wins/games
winpct = winpct + [pct]
starting_pitchers["win percentage"] = winpct
# plot starting pitchers win percentage over time
starting_winpct = starting_pitchers.groupby("yearID")["win percentage"].mean()
st_winpct_plot = starting_winpct.plot.line(title="Starting Pitcher Win Percentage Over Time")
st_winpct_plot.set_ylabel("Win Percentage")
st_winpct_plot.set_xlabel("Year")
According to this graph, we can see that starting pitcher win percentage has actually decreased over the years, so that is most likely not the reason that pitchers are being paid more, but that fact is not necessarily conclusive so we will do more investigation
In conclusion for this section, since there is no concrete discovery that we can draw from these graphs, we will have to take a different approach in trying to discover why starting pitchers have been so valuable recently
We want to identify the league leaders at each position in the last five years to identify the prime statistic on which pitchers are paid
# get data frame with starting pitchers just from the last five years
starting_pitchers_recent = starting_pitchers[starting_pitchers["yearID"] > 2011]
# take top 100 earned run average pitcher seasons
era_leaders = starting_pitchers_recent.sort_values(by=["ERA"])[:100]
print ("\nTop 10 Leaders in ERA")
display(era_leaders[:10])
# take the mean salary for these pitchers
mean_era_salary = era_leaders["standard_salary"].mean()
# take top 100 strikeout pitcher seasons
so_leaders = starting_pitchers_recent.sort_values(by=["SO"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Strikeouts")
display(so_leaders[:10])
# take the mean salary for these pitchers
mean_so_salary = so_leaders["standard_salary"].mean()
# take top 100 win percentage pitcher seasons
winpct_leaders = starting_pitchers_recent.sort_values(by=["win percentage"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Win Percentage")
display(winpct_leaders[:10])
# take the mean salary for these pitchers
mean_winpct_salary = winpct_leaders["standard_salary"].mean()
data = [mean_era_salary, mean_so_salary, mean_winpct_salary]
ind = ["ERA", "Strikeouts", "Win Pct"]
mean_salary = pd.Series(data, index=ind)
mean_statistic_bar = mean_salary.plot.bar(title="Mean Standard Salary for Statistic Leaders Last Five Years")
mean_statistic_bar.set_ylabel("Standard Salary")
From this bar graph, it is clear to see that teams value strikeouts more than earned run average and win percentage, most likely because strikeouts are one of the main measures of a pitchers ability to complete dominate batters
I need to retrieve the teams data from Teams.csv because I will need data about their wins and losses from the past five years
# read in teams data from csv file
teams = pd.read_csv("./data/Teams.csv")
# cut out all of the records that are in the year 2011 or before
recent_teams = teams[teams["yearID"] > 2011]
# create new column win percentage
winpcts = []
for row in recent_teams.iterrows():
wins = row[1]["W"]
games = row[1]["G"]
pct = wins/games
winpcts = winpcts + [pct]
recent_teams["Win Percentage"] = winpcts
# display teams dataframe
display(recent_teams.head())
We want to see the amount of money that each team spends on their starting pitching staff in the past five years
# get mean starting pitcher salary for each team from the past five years
team_starter_salaries = starting_pitchers_recent.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_starter_salaries = team_starter_salaries.groupby("teamID").mean()
# plot salaries on a bar graph
team_salary_bar = mean_team_starter_salaries.plot.bar(title = "Mean Starting Pitcher Standard Salary in the Past Five Years")
team_salary_bar.set_ylabel("Mean Standard Salary")
From this bar graph, it is clear than teams such as the San Francisco Giants, the Detroit Tigers, and the Los Angeles Dodgers have all spent a lot of money on starting pitching. However, this information is not useful until we can see this spending as a proportion of their total payroll
# get dataframe with recent relief pitching
recent_relief_pitching = relief_pitchers[relief_pitchers["yearID"] > 2011]
# get dataframe with recent batting
recent_batting = batters[batters["yearID"] > 2011]
# get series with mean relief pitcher salaries
team_reliever_salaries = recent_relief_pitching.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_reliever_salaries = team_reliever_salaries.groupby("teamID").mean()
# get series with mean batting salaries
team_batting_salaries = recent_batting.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_batting_salaries = team_batting_salaries.groupby("teamID").mean()
# get series with mean total salary
mean_total_salaries = mean_team_starter_salaries + mean_team_reliever_salaries + mean_team_batting_salaries
# get series with proportion of payroll spent on starting pitching
proportion_starter_salary = mean_team_starter_salaries/mean_total_salaries
# plot the proportion series
proportion_plot = proportion_starter_salary.plot.bar()
proportion_plot.set_title("Proportion of Payroll Spent on Starting Pitching in Last Five Years")
With this graph it now gives an entirely new picture and we can see that it is actually the Seattle Mariners and Chicago Cubs who spend the most on their starting pitching. The Cincinnati Reds and the Kansas City Royals spend the least amount of their payroll on starting pitching
Now that we have seen how much each team spends on starting pitching, we also want to see how this correlates with their winning percentage
# get mean winning percentage for each time in the last five years
team_winning_pct = recent_teams.groupby("teamID")["Win Percentage"].mean()
# combine the two series into one dataframe
percentage_df = pd.concat([proportion_starter_salary, team_winning_pct], axis=1)
# rename win percentage column
percentage_df = percentage_df.rename(columns={'Win Percentage': 'win percentage', 'standard_salary':'starting pitcher salary proportion'})
# make bar plot with winning percentage and salary
percentage_df.plot.bar(title = "Winning Percentage Compared to Proportion of Payroll Spent on Starters",figsize=(8,6))
This bar graph does not necessarily give us anything conclusive about the relationship between winning percentage and proportion of payroll spent on starting pitchers, so instead we can try to make a scatter plot to see the correlation
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["starting pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
x = row[1]["win percentage"]
y = row[1]["starting pitcher salary proportion"]
val = row[0]
plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("starting pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Starters from 2012 to 2016")
From the scatter plot, we can now see that there is positive correlation between the proportion of payroll spent on starting pitchers and the winning percentage of a team. We now want to compare this scatter plot to the proportion of payroll spent on relief pitching and the proportion of payroll spent on batters
# get proportion of payroll spent on batting
proportion_batter_salary = mean_team_batting_salaries/mean_total_salaries
# get proportion of payroll spent on relievers
proportion_reliever_salary = mean_team_reliever_salaries/mean_total_salaries
# add two new series to the proportions dataframe
percentage_df["relief pitcher salary proportion"] = proportion_reliever_salary
percentage_df["batter salary proportion"] = proportion_batter_salary
# make a scatter plot with reliever proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["relief pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
x = row[1]["win percentage"]
y = row[1]["relief pitcher salary proportion"]
val = row[0]
plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("relief pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Relievers from 2012 to 2016")
# make a scatter plot with batter proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["batter salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
x = row[1]["win percentage"]
y = row[1]["batter salary proportion"]
val = row[0]
plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("batter salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Batters from 2012 t0 2016")
From the relief pitcher and batter graphs, we can see that there is also positive correlation between paying batters and winning percentage. However, we can also see that there is negative correlation between paying relief pitchers and winning percentage, so this suggests that teams should be paying starting pitchers and batters, but should not be paying relief pitchers. A couple of the more interesting points are that the Cincinnati Reds spend a lot of money on relief pitching and not as much on starting pitching and they are not as successful. Meanwhile, two of the most successful teams over the past five years, the Los Angeles Dodgers and the Washington Nationals, spend a large proportion of their payroll on starting pitching and batting, and barely any money on relief pitching. Based on the scatter plot, it is difficult to see which position has more correlation between starting pitching and batting, so we want to get the actual correlation value.
# Get correlation between starting pitcher salary proportion and winning percentage
starter_corr = percentage_df["win percentage"].corr(percentage_df["starting pitcher salary proportion"])
# Get correlation between relief pitcher salary proportion and winning percentage
reliever_corr = percentage_df["win percentage"].corr(percentage_df["relief pitcher salary proportion"])
# Get correlation between batter salary proportion and winning percentage
batter_corr = percentage_df["win percentage"].corr(percentage_df["batter salary proportion"])
# print output statements with the correlations
print("The Correlation between Win Percentage and Starting Pitcher Salary Proportion is " + str(starter_corr))
print("The Correlation between Win Percentage and Relief Pitcher Salary Proportion is " + str(reliever_corr))
print("The Correlation between Win Percentage and Batter Salary Proportion is " + str(batter_corr))
These results tell us that there is a higher correlation between starting pitching and winning percentage and win percentage than between batting and winning percentage. The relief pitching and winning percentage correlation is negative.
Relief Pitcher Conclusion: At this point we can draw a final conclusion on relief pitchers. Teams do not pay relief pitchers very much of their payroll, which is the right move since paying relief pitchers a lot of money is not a good use of resources due to the negative correlation between proportion of money spent on relief pitchers and winning percentage. Starting pitchers and batters are paid more because they have more of an impact on team success
Due to the fact that we have reached a conclusion on relief pitching, we will now proceed with just comparing starting pitching and batting. While relief pitching might still be included in graphs and tables, I will not spend as much time on that analysis
In this next stage we are going to compare starting pitchers and batters in the past five years and see how the correlation between paying each position and winning percentage has developed recently
team_winning_pct = recent_teams.groupby(["yearID", "teamID"])["Win Percentage"].mean()
team_2016 = team_winning_pct[2016]
starting_pitching_2016 = starting_pitchers[starting_pitchers["yearID"] == 2016]
batters_2016 = batters[batters["yearID"] == 2016]
relief_pitching_2016 = relief_pitchers[relief_pitchers["yearID"] == 2016]
# get series with mean starting pitcher salaries from 2016
mean_team_starters_2016 = starting_pitching_2016.groupby("teamID")["standard_salary"].mean()
# get series with mean batter salaries from 2016
mean_team_batters_2016 = batters_2016.groupby("teamID")["standard_salary"].mean()
# get series with mean batter salaries from 2016
mean_team_relievers_2016 = relief_pitching_2016.groupby("teamID")["standard_salary"].mean()
# get series with mean batter plus starting pitcher salaries
mean_team_batters_starters_2016 = mean_team_starters_2016 + mean_team_batters_2016 + mean_team_relievers_2016
# get series with proportion of payroll spent on starting pitching
prop_starters = mean_team_starters_2016/mean_team_batters_starters_2016
# get series with proportion of payroll spent on starting pitching
prop_batters = mean_team_batters_2016/mean_team_batters_starters_2016
prop_relievers = mean_team_relievers_2016/mean_team_batters_starters_2016
# create dataframe with proportions
prop_df = pd.concat([prop_batters, team_2016], axis=1)
prop_df = prop_df.rename(columns={'Win Percentage': 'win percentage 2016', 'standard_salary':'batter 2016'})
prop_df = pd.concat([prop_df, prop_starters], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'starter 2016'})
prop_df = pd.concat([prop_df, prop_relievers], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'reliever 2016'})
# do the same for all years 2012 to 2015
for year in range(2012, 2016):
team_year = team_winning_pct[year]
starting_pitching_year = starting_pitchers[starting_pitchers["yearID"] == year]
batters_year = batters[batters["yearID"] == year]
relief_pitching_year = relief_pitchers[relief_pitchers["yearID"] == year]
# get series with mean starting pitcher salaries from each year
mean_team_starters_year = starting_pitching_year.groupby("teamID")["standard_salary"].mean()
# get series with mean batter salaries from each year
mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
# get series with mean batter salaries from each year
mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
mean_team_relievers_year = relief_pitching_year.groupby("teamID")["standard_salary"].mean()
# get series with mean batter plus starting pitcher salaries
mean_team_batters_starters_year = mean_team_starters_year + mean_team_batters_year + mean_team_relievers_year
# get series with proportion of payroll spent on starting pitching
prop_starters = mean_team_starters_year/mean_team_batters_starters_year
# get series with proportion of payroll spent on starting pitching
prop_batters = mean_team_batters_year/mean_team_batters_starters_year
prop_relievers = mean_team_relievers_year/mean_team_batters_starters_year
# add to proportions dataframe
prop_df = pd.concat([prop_df, prop_batters, team_year], axis=1)
win_col = "win percentage " + str(year)
batter_col = "batter " + str(year)
starter_col = "starter " + str(year)
reliever_col = "reliever " + str(year)
prop_df = prop_df.rename(columns={'Win Percentage': win_col, 'standard_salary':batter_col})
prop_df = pd.concat([prop_df, prop_starters], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':starter_col})
prop_df = pd.concat([prop_df, prop_relievers], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':reliever_col})
display(prop_df.head())
# get correlations between win percentage, and starter and batter salary proportion
batter_corrs = []
starter_corrs = []
reliever_corrs = []
ind = []
for year in range(2012,2017):
win_year = "win percentage " + str(year)
batter_year = "batter " + str(year)
starter_year = "starter " + str(year)
reliever_year = "reliever " + str(year)
batter_corr = prop_df[win_year].corr(prop_df[batter_year])
starter_corr = prop_df[win_year].corr(prop_df[starter_year])
reliever_corr = prop_df[win_year].corr(prop_df[reliever_year])
batter_corrs = batter_corrs + [batter_corr]
starter_corrs = starter_corrs + [starter_corr]
reliever_corrs = reliever_corrs + [reliever_corr]
ind = ind + [year]
# make a series for each set of correlations indexed by year
batter_correlations = pd.Series(batter_corrs, index=ind)
starter_correlations = pd.Series(starter_corrs, index=ind)
reliever_correlations = pd.Series(reliever_corrs, index=ind)
# plot the correlations on a line graph
correlations_plot = batter_correlations.plot.line(legend=True)
starter_correlations.plot.line(ax=correlations_plot, legend=True)
reliever_correlations.plot.line(ax=correlations_plot, legend=True)
correlations_plot.set_title("Salary Proportion and Winning Percentage Correlations From 2012 to 2016")
correlations_plot.set_ylabel("Correlation")
correlations_plot.set_xlabel("Year")
correlations_plot.legend(["Batters", "Starting Pitchers", "Relief Pitchers"])
This graph shows the correlation between proportion of salary spent on batters and winning percentage over time and correlation between proportion of salary spent on starting pitchers and winning percentage over time. From this graph, it is clear to see that over the past five years, the correlation between paying starting pitchers and winning has trended up, and the correlation between paying batters and winning increased for a couple years but is now trending down. The correlation between relief pitchers and winning is also on this graph, but as we could have guessed from out earlier conclusion, it is trending way down. We do not have any salary data for any year after 2017, but if this was the year 2017 it is clear that it is best to pay starting pitchers more money because recently spending money on starting pitching has paid off.
We want to have a model that can predict win percentage based on the proportion of payroll a team uses at each position
# create new dataframe with salarie proportions and winning percentage
winning_percentages = []
relief_pitcher_sal = []
starting_pitcher_sal = []
batter_sal = []
for row in prop_df.iterrows():
for year in range(2012,2017):
win_year = "win percentage " + str(year)
batter_year = "batter " + str(year)
starter_year = "starter " + str(year)
reliever_year = "reliever " + str(year)
winning_percentages = winning_percentages + [row[1][win_year]]
batter_sal = batter_sal + [row[1][batter_year]]
starting_pitcher_sal = starting_pitcher_sal + [row[1][starter_year]]
relief_pitcher_sal = relief_pitcher_sal + [row[1][reliever_year]]
data = {"relief_pitcher_proportion":relief_pitcher_sal, "starting_pitcher_proportion":starting_pitcher_sal,
"batter_proportion": batter_sal, "winning_percentage":winning_percentages}
salary_proportions = pd.DataFrame(data)
display(salary_proportions.head())
# Make model using newly created dataframe
salary_proportions
# set features and training data
features = ["relief_pitcher_proportion", "starting_pitcher_proportion", "batter_proportion"]
Xtrain = pd.get_dummies(salary_proportions[features])
Ytrain = salary_proportions["winning_percentage"]
# Create series with the given data
Xnew = pd.Series(index=Xtrain.columns)
Xnew["starting_pitcher_proportion"] = 0.5
Xnew["batter_proportion"] = 0.5
Xnew["relief_pitcher_proportion"] = 0
# Standardize the variables.
Xtrain_mean = Xtrain.mean()
Xtrain_std = Xtrain.std()
Xtrain_sc = (Xtrain - Xtrain_mean) / Xtrain_std
Xnew_sc = (Xnew - Xtrain_mean) / Xtrain_std
# Find index of 30 nearest neighbors.
dists = np.sqrt(((Xtrain_sc - Xnew_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:30].index
# Average the labels of these 30 nearest neighbors
Ytrain.loc[i_nearest].mean()
According to this model, if you spend the mean salary for they year on relief pitching, and then split the rest of your salary between starting pitching and batting, you will win approximately half of your games.
My original hypothesis was that starting pitchers are paid the most, followed by batters and then relief pitchers. I was correct in that recently starting pitchers have been getting paid the most, followed closely by batters, with relief pitchers way below them in terms of average salary. I also believed that teams value earned run average over other statistics, but I was wrong and it turns out that actually strikeouts are the most valued statistic by teams when they pay their pitchers. I also went on to discover that paying relief pitchers too much money can really hurt a teams chances of winning, and it is best to pay starting pitchers because that will bring you the most wins. Starting pitchers and batters are both paid fairly, neither are overpaid or underpaid. Therefore, starting pitchers are the most valued position, but batters are still very valuable, so you should try to pay those two positions a good part of your payroll, while trying to minimize the amount of money spent on relief pitching. This is supported by the Washington Nationals, who just won the World Series even though their relief pitchers were ranked at the bottom of the league.