Harrison Pratt¶

The Value of the Major League Baseball Pitcher¶

https://hpratt1.github.io/

The Problem¶

Baseball teams are usually divided into two different parts: their offense led by their batters, and their defense led by their pitchers. Every offseason, Major League Baseball teams have to make an important decision: Should they spend their money on their batting or their pitching? In this study, I am going to attempt to form an answer to this question by observing how teams value their pitchers and whether they should pay them more or less. Earlier this semester, we did an investigation of payroll efficiency for each team. This study is going to take this a step further by investigating how a team can achieve payroll efficiency by how much they should allocate to their pitching staff. While we found some interesting information in our moneyball project about how efficient teams have been, this study should provide a whole new level of strategy for general managers on which position to pay in order to achieve efficiency.

My Hypothesis¶

My hypothesis is that starting pitchers will be valued more than batters, and both will be valued more than relief pitchers. The most sought after statistic is earned run average, since that reflects a pitchers ability to beat the opposition. I also believe that starting pitchers are underpaid and batters are overpaid.

Retrieving and Tidying Data¶

I am retrieving my data from csv files located in Lahman's Baseball Database. In this cell, I will read in the data from four csv files: Pitching.csv, Batting.csv, Salaries.csv, People.csv. After that, I will do some joins so that the full player name is in the batting and pitching dataframe and also get a pitchers and batters salaries dataframe. During this process, I will standardize the payrolls across years to account for the rise in payroll rate

# read in salaries.csv, pitching.csv, and people.csv into dataframes
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
salaries_df = pd.read_csv("./data/Salaries.csv")
pitching_df = pd.read_csv("./data/Pitching.csv")
batting_df = pd.read_csv("./data/Batting.csv")
people_df = pd.read_csv("./data/People.csv")

# get salary aggregations
year_salary_aggs = salaries_df.groupby(["yearID"]).agg(["mean", "std"])
year_salary_aggs = year_salary_aggs["salary"]

# create new column standard salary
standardized_salaries = []
for row in salaries_df.iterrows():
    yearID = row[1]["yearID"]
    salary = row[1]["salary"]
    mean = year_salary_aggs.loc[yearID]["mean"]
    std = year_salary_aggs.loc[yearID]["std"]
    standardized_salary = (salary - mean)/std
    standardized_salaries = standardized_salaries + [standardized_salary]
salaries_df["standard_salary"] = standardized_salaries


# join pitching and people so that the player name is now part of pitching dataframe
pitching_df_names = pd.merge(pitching_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")

# get pitcher salary into pitching_df
pitching = pd.merge(pitching_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="left")
# trim pitching salaries to only get years with relevant salary data
pitching_salaries = pitching[pitching["yearID"] >= 1985]
# drop any record that has NaN for a column
pitchers = pitching_salaries.dropna()
print("\nPitcher Statistics")
display(pitchers.head())

# join batting and people so that the player name is part of batting dataframe
batting_df_names = pd.merge(batting_df, people_df[["playerID", "nameFirst", "nameLast"]], on="playerID", how="left")
# get batter salary into batting dataframe
batting = pd.merge(batting_df_names, salaries_df[["playerID", "yearID", "standard_salary"]], on=["playerID", "yearID"], how="inner")
# trim pitching salaries to only get years with relevant salary data
batting_salaries = batting[batting["yearID"] >= 1985]
# trim batters with less than 100 At Bats to get rid of pitchers from that table
batter_salaries = batting_salaries[batting_salaries["AB"] > 100]
# drop any record that has NaN for a column
batters = batter_salaries.dropna()
print("\n\nBatter Statistics")
display(batters.head())

Pitcher Statistics


Batter Statistics

We also want to separate starting pitchers and relief pitchers for future analysis. Since there are six months in the major league baseball season, we will classify pitchers as those who did not have at least one start a month, so any players that have at least six games started is a starting pitcher, any pitcher with fewer than six games started is a relief pitcher

# starting pitchers are all the pitchers that have at least 5 games started
starting_pitchers = pitchers[pitchers["GS"] >= 6]
print("\nStarting Pitcher Statistics")
display(starting_pitchers.head())
# relief pitchers are all the pitchers that have less than 5 games started
relief_pitchers = pitchers[pitchers["GS"] < 6]
print("\n\nRelief Pitcher Statistics")
display(relief_pitchers.head())

Starting Pitcher Statistics


Relief Pitcher Statistics

Examine important statistics over time¶

To provide a bit of context, I am going to set up a few graphs tracking some important statistics for pitchers such as earned run average and strikeouts, and for batters such as batting average and home runs

Track Pitching Statistics Over Time¶

Use line graphs to track earned run average and strikeouts over time

# plot total strikeouts over time on line graph
total_strikeouts = pitchers.groupby("yearID")["SO"].sum()
so_plot = total_strikeouts.plot.line(title="Total Strikeouts Over Time") 
so_plot.set_ylabel("Total Strikeouts")
so_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

From the graph we can see that since 1985 total strikeouts have steadily increased with a massive rise around 1995 followed by a relatively small decline a couple of years after

# plot earned run average over time on line graph
mean_era = pitchers.groupby("yearID")["ERA"].mean()
era_plot = mean_era.plot.line(title="Earned Run Average Over Time") 
era_plot.set_ylabel("ERA")
era_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

From the graph we can see that ERA was increased up until around its peak in 2000, and since then it has steadily decreased, which means that pitching has improved since 2000

Track Batting Statistics Over Time¶

# create new column Batting Avg
avgs = []
for row in batters.iterrows():
    hits = row[1]["H"]
    atbats = row[1]["AB"]
    avg = hits/atbats
    avgs = avgs + [avg]
batters["AVG"] = avgs

# plot batting average over time on line graph
batting_average = batters.groupby("yearID")["AVG"].mean()
ba_plot = batting_average.plot.line(title="Batting Average Over Time") 
ba_plot.set_ylabel("Batting Average")
ba_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

From the graph we can see that batting average had a major increase in the early 1990's and stayed high for a while until the late 2000's in which it had a precipitous drop and has been very low in the recent past

# plot batting average over time on line graph
home_runs = batters.groupby("yearID")["HR"].mean()
hr_plot = home_runs.plot.line(title="Home Runs Over Time") 
hr_plot.set_ylabel("Home Runs")
hr_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

From the graph we can see that while home runs were increasing since 1990 they started to decrease starting around 2005 but recently there has been a massive increase in the number of home runs hit per season. The spike in the middle of the graph is most likely a result of the steroid era.

Final Conclusions From Statistics Over Time¶

From analyzing these graphs, we can see that in the past decade there has been a decrease in batting average and earned run average, but an increase in home runs and strikeouts. These trends can be attributed to two main factors. The first factor is that batters are now attempting to hit for power instead of hitting for contact, which means that while there would be an increase in home runs, batting average would decrease and strikeouts would decrease. The second factor is that pitching has been improving, which would lower batting average and earned run average, while also increasing strikeouts.

Comparing Mean Pitcher Salary and Mean Batter Salary¶

Use line graphs to compare average pitcher salary to average total player salary from 1985 to 2016

# get mean pitcher salaries
mean_pitcher_salaries = pitchers.groupby("yearID")["standard_salary"].mean()
# get mean batter salaries
mean_batter_salaries = batters.groupby("yearID")["standard_salary"].mean()
# plot mean salaries
mean_plot = mean_pitcher_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_plot, legend=True)
mean_plot.set_title("Mean Standard Salaries Over Time")
mean_plot.set_ylabel("Mean Standard Salary")
mean_plot.set_xlabel("Year")
mean_plot.legend(["Pitchers", "Batters"])

<matplotlib.legend.Legend at 0x7f478cd50c50>

From this graph we can see that since 1985 batters have always gotten paid more than pitchers, and as the years have progressed there has not been a very big change in how batters and pitchers have been paid. Interestingly enough, in the 1990's there actually was a time in which the value of batters increased and the value of pitchers decreased, but during the turn of the century this evened back out and now in the present day pitchers and batters are valued the same way that they were in 1985

Compare Starting Pitcher and Relief Pitcher Salary¶

One of the reasons that pitchers may be getting paid less than batters is that relief pitchers are often valued less than other positions, which would lower the overall mean pitcher salary. Therefore, in this part we want to compare salary for starting pitchers and relief pitchers to see if there is a difference

# get mean starting pitcher salaries
mean_starter_salaries = starting_pitchers.groupby("yearID")["standard_salary"].mean()
# get mean relief pitchers salaries
mean_relief_salaries = relief_pitchers.groupby("yearID")["standard_salary"].mean()

# plot mean starting salaries and mean relief salaries over time
mean_pitchers_plot = mean_starter_salaries.plot.line(legend=True)
mean_relief_salaries.plot.line(ax=mean_pitchers_plot, legend=True)
mean_pitchers_plot.set_title("Mean Pitcher Standard Salaries Over Time")
mean_pitchers_plot.set_ylabel("Mean Standard Salary")
mean_pitchers_plot.set_xlabel("Year")
mean_pitchers_plot.legend(["Starting Pitchers", "Relief Pitchers"])

<matplotlib.legend.Legend at 0x7f478cd01358>

This plot shows that while in 1985 relief pitchers and starting pitchers were relatively close in how they were paid, since 1985 the value of starting pitchers has trended upwards, while the value of relief pitchers have trended downward, and it shows that starting pitchers are deemed more valuable than relief pitchers by a lot in the present day.

Compare Mean Batter Salary, Mean Starting Pitcher Salary, and Mean Relief Pitcher Salary¶

Now that we have seen that there is indeed a difference between starting pitchers and relief pitchers, we can compare starting pitchers, relief pitchers and batters to see the complete order in how they are valued

# plot mean starting salaries and mean relief salaries over time
mean_starters_plot = mean_starter_salaries.plot.line(legend=True)
mean_batter_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_relief_salaries.plot.line(ax=mean_starters_plot, legend=True)
mean_starters_plot.set_title("Mean Standard Salary Over Time")
mean_starters_plot.set_ylabel("Mean Standard Salary")
mean_starters_plot.set_xlabel("Year")
mean_starters_plot.legend(["Starting Pitchers", "Batters", "Relief Pitchers"])

<matplotlib.legend.Legend at 0x7f47b03fd518>

This plot shows that while over the years the value of starting pitchers and the value of batters have traded places, since the early 200's the value of starting pitchers have trended upward signficantly, thus giving us the conclusion that starting pitchers are now valued more than batters. Relief pitchers, on the other hand, are valued much lower than both starting pitchers and batters.

Tracking Starting Pitching Statistics¶

Since starting pitchers have recently been getting paid more in the last five years, we want to see what is valued in a pitcher that causes them to get larger contracts.

# plot starting pitchers strikeouts over time
starting_so = starting_pitchers.groupby("yearID")["SO"].sum()
st_so_plot = starting_so.plot.line(title="Starting Pitcher Strikeouts Over Time") 
st_so_plot.set_ylabel("Strikeouts")
st_so_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

This graph strongly resembles the graph for all pitchers strikeouts, therefore there are no extra conclusions to be taken from this graph

# plot starting pitchers earned run average over time
starting_era = starting_pitchers.groupby("yearID")["ERA"].mean()
st_era_plot = starting_era.plot.line(title="Starting Pitcher Earned Run Average Over Time") 
st_era_plot.set_ylabel("Earned Run Average")
st_era_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

As with strikeouts, this graph also resembles the graph for all pitchers earned run average, therefore there is no conclusion that we can draw from this graph

With starting pitchers, their record is an important statistic so we are going to include that in the study as an extra statistic to track

# Make new column called win percentage
winpct = []
for row in starting_pitchers.iterrows():
    wins = row[1]["W"]
    games = row[1]["GS"]
    pct = wins/games
    winpct = winpct + [pct]
starting_pitchers["win percentage"] = winpct

# plot starting pitchers win percentage over time
starting_winpct = starting_pitchers.groupby("yearID")["win percentage"].mean()
st_winpct_plot = starting_winpct.plot.line(title="Starting Pitcher Win Percentage Over Time") 
st_winpct_plot.set_ylabel("Win Percentage")
st_winpct_plot.set_xlabel("Year")

Text(0.5, 0, 'Year')

According to this graph, we can see that starting pitcher win percentage has actually decreased over the years, so that is most likely not the reason that pitchers are being paid more, but that fact is not necessarily conclusive so we will do more investigation

In conclusion for this section, since there is no concrete discovery that we can draw from these graphs, we will have to take a different approach in trying to discover why starting pitchers have been so valuable recently

Identifying Statistics Leaders in the Last Five Years¶

We want to identify the league leaders at each position in the last five years to identify the prime statistic on which pitchers are paid

# get data frame with starting pitchers just from the last five years
starting_pitchers_recent = starting_pitchers[starting_pitchers["yearID"] > 2011]

# take top 100 earned run average pitcher seasons 
era_leaders = starting_pitchers_recent.sort_values(by=["ERA"])[:100]
print ("\nTop 10 Leaders in ERA")
display(era_leaders[:10])
# take the mean salary for these pitchers
mean_era_salary = era_leaders["standard_salary"].mean()

# take top 100 strikeout pitcher seasons 
so_leaders = starting_pitchers_recent.sort_values(by=["SO"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Strikeouts")
display(so_leaders[:10])
# take the mean salary for these pitchers
mean_so_salary = so_leaders["standard_salary"].mean()

# take top 100 win percentage pitcher seasons 
winpct_leaders = starting_pitchers_recent.sort_values(by=["win percentage"], ascending=False)[:100]
print ("\n\nTop 10 Leaders in Win Percentage")
display(winpct_leaders[:10])
# take the mean salary for these pitchers
mean_winpct_salary = winpct_leaders["standard_salary"].mean()

data = [mean_era_salary, mean_so_salary, mean_winpct_salary]
ind = ["ERA", "Strikeouts", "Win Pct"]
mean_salary = pd.Series(data, index=ind) 
mean_statistic_bar = mean_salary.plot.bar(title="Mean Standard Salary for Statistic Leaders Last Five Years")
mean_statistic_bar.set_ylabel("Standard Salary")

Top 10 Leaders in ERA


Top 10 Leaders in Strikeouts


Top 10 Leaders in Win Percentage

Text(0, 0.5, 'Standard Salary')

From this bar graph, it is clear to see that teams value strikeouts more than earned run average and win percentage, most likely because strikeouts are one of the main measures of a pitchers ability to complete dominate batters

Retrieve and Tidy Team Data¶

I need to retrieve the teams data from Teams.csv because I will need data about their wins and losses from the past five years

# read in teams data from csv file
teams = pd.read_csv("./data/Teams.csv")
# cut out all of the records that are in the year 2011 or before
recent_teams = teams[teams["yearID"] > 2011]

# create new column win percentage
winpcts = []
for row in recent_teams.iterrows():
    wins = row[1]["W"]
    games = row[1]["G"]
    pct = wins/games
    winpcts = winpcts + [pct]
recent_teams["Win Percentage"] = winpcts

# display teams dataframe
display(recent_teams.head())

View Relationship between Starting Pitcher Salary and Win Percentage¶

We want to see the amount of money that each team spends on their starting pitching staff in the past five years

# get mean starting pitcher salary for each team from the past five years
team_starter_salaries = starting_pitchers_recent.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_starter_salaries = team_starter_salaries.groupby("teamID").mean()

# plot salaries on a bar graph
team_salary_bar = mean_team_starter_salaries.plot.bar(title = "Mean Starting Pitcher Standard Salary in the Past Five Years")
team_salary_bar.set_ylabel("Mean Standard Salary")

Text(0, 0.5, 'Mean Standard Salary')

From this bar graph, it is clear than teams such as the San Francisco Giants, the Detroit Tigers, and the Los Angeles Dodgers have all spent a lot of money on starting pitching. However, this information is not useful until we can see this spending as a proportion of their total payroll

# get dataframe with recent relief pitching
recent_relief_pitching = relief_pitchers[relief_pitchers["yearID"] > 2011]

# get dataframe with recent batting
recent_batting = batters[batters["yearID"] > 2011]

# get series with mean relief pitcher salaries
team_reliever_salaries = recent_relief_pitching.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_reliever_salaries = team_reliever_salaries.groupby("teamID").mean()

# get series with mean batting salaries
team_batting_salaries = recent_batting.groupby(["teamID", "yearID"])["standard_salary"].mean()
mean_team_batting_salaries = team_batting_salaries.groupby("teamID").mean()

# get series with mean total salary
mean_total_salaries = mean_team_starter_salaries + mean_team_reliever_salaries + mean_team_batting_salaries

# get series with proportion of payroll spent on starting pitching
proportion_starter_salary = mean_team_starter_salaries/mean_total_salaries

# plot the proportion series
proportion_plot = proportion_starter_salary.plot.bar()
proportion_plot.set_title("Proportion of Payroll Spent on Starting Pitching in Last Five Years")

Text(0.5, 1.0, 'Proportion of Payroll Spent on Starting Pitching in Last Five Years')

With this graph it now gives an entirely new picture and we can see that it is actually the Seattle Mariners and Chicago Cubs who spend the most on their starting pitching. The Cincinnati Reds and the Kansas City Royals spend the least amount of their payroll on starting pitching

Now that we have seen how much each team spends on starting pitching, we also want to see how this correlates with their winning percentage

# get mean winning percentage for each time in the last five years
team_winning_pct = recent_teams.groupby("teamID")["Win Percentage"].mean()
# combine the two series into one dataframe
percentage_df = pd.concat([proportion_starter_salary, team_winning_pct], axis=1)
# rename win percentage column
percentage_df = percentage_df.rename(columns={'Win Percentage': 'win percentage', 'standard_salary':'starting pitcher salary proportion'})
# make bar plot with winning percentage and salary
percentage_df.plot.bar(title = "Winning Percentage Compared to Proportion of Payroll Spent on Starters",figsize=(8,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7f478b0847b8>

This bar graph does not necessarily give us anything conclusive about the relationship between winning percentage and proportion of payroll spent on starting pitchers, so instead we can try to make a scatter plot to see the correlation

Finding Correlation between Proportion of Money Spent on each Position and Win Percentage¶

plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["starting pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["starting pitcher salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("starting pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Starters from 2012 to 2016")

Text(0.5, 1.0, 'Winning Percentage Compared to Proportion of Payroll Spent on Starters from 2012 to 2016')

From the scatter plot, we can now see that there is positive correlation between the proportion of payroll spent on starting pitchers and the winning percentage of a team. We now want to compare this scatter plot to the proportion of payroll spent on relief pitching and the proportion of payroll spent on batters

# get proportion of payroll spent on batting
proportion_batter_salary = mean_team_batting_salaries/mean_total_salaries
# get proportion of payroll spent on relievers
proportion_reliever_salary = mean_team_reliever_salaries/mean_total_salaries

# add two new series to the proportions dataframe
percentage_df["relief pitcher salary proportion"] = proportion_reliever_salary
percentage_df["batter salary proportion"] = proportion_batter_salary

# make a scatter plot with reliever proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["relief pitcher salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["relief pitcher salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("relief pitcher salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Relievers from 2012 to 2016")

# make a scatter plot with batter proportions
plt.figure()
plt.scatter(percentage_df["win percentage"], percentage_df["batter salary proportion"])
# add team label to each point
for row in percentage_df.iterrows():
    x = row[1]["win percentage"]
    y = row[1]["batter salary proportion"]
    val = row[0]
    plt.annotate(val, (x,y))
# add labels and title for the plot
plt.ylabel("batter salary proportion")
plt.xlabel("win percentage")
plt.title("Winning Percentage Compared to Proportion of Payroll Spent on Batters from 2012 t0 2016")

Text(0.5, 1.0, 'Winning Percentage Compared to Proportion of Payroll Spent on Batters from 2012 t0 2016')

From the relief pitcher and batter graphs, we can see that there is also positive correlation between paying batters and winning percentage. However, we can also see that there is negative correlation between paying relief pitchers and winning percentage, so this suggests that teams should be paying starting pitchers and batters, but should not be paying relief pitchers. A couple of the more interesting points are that the Cincinnati Reds spend a lot of money on relief pitching and not as much on starting pitching and they are not as successful. Meanwhile, two of the most successful teams over the past five years, the Los Angeles Dodgers and the Washington Nationals, spend a large proportion of their payroll on starting pitching and batting, and barely any money on relief pitching. Based on the scatter plot, it is difficult to see which position has more correlation between starting pitching and batting, so we want to get the actual correlation value.

# Get correlation between starting pitcher salary proportion and winning percentage
starter_corr = percentage_df["win percentage"].corr(percentage_df["starting pitcher salary proportion"])

# Get correlation between relief pitcher salary proportion and winning percentage
reliever_corr = percentage_df["win percentage"].corr(percentage_df["relief pitcher salary proportion"])

# Get correlation between batter salary proportion and winning percentage
batter_corr = percentage_df["win percentage"].corr(percentage_df["batter salary proportion"])

# print output statements with the correlations
print("The Correlation between Win Percentage and Starting Pitcher Salary Proportion is " + str(starter_corr))
print("The Correlation between Win Percentage and Relief Pitcher Salary Proportion is " + str(reliever_corr))
print("The Correlation between Win Percentage and Batter Salary Proportion is " + str(batter_corr))

The Correlation between Win Percentage and Starting Pitcher Salary Proportion is 0.29184971077334454
The Correlation between Win Percentage and Relief Pitcher Salary Proportion is -0.32338425998851844
The Correlation between Win Percentage and Batter Salary Proportion is 0.2617699537454118

These results tell us that there is a higher correlation between starting pitching and winning percentage and win percentage than between batting and winning percentage. The relief pitching and winning percentage correlation is negative.

Relief Pitcher Conclusion: At this point we can draw a final conclusion on relief pitchers. Teams do not pay relief pitchers very much of their payroll, which is the right move since paying relief pitchers a lot of money is not a good use of resources due to the negative correlation between proportion of money spent on relief pitchers and winning percentage. Starting pitchers and batters are paid more because they have more of an impact on team success

Due to the fact that we have reached a conclusion on relief pitching, we will now proceed with just comparing starting pitching and batting. While relief pitching might still be included in graphs and tables, I will not spend as much time on that analysis

Should Starting Pitchers or Batters Get Paid More Money?¶

In this next stage we are going to compare starting pitchers and batters in the past five years and see how the correlation between paying each position and winning percentage has developed recently

team_winning_pct = recent_teams.groupby(["yearID", "teamID"])["Win Percentage"].mean()
team_2016 = team_winning_pct[2016]

starting_pitching_2016 = starting_pitchers[starting_pitchers["yearID"] == 2016]
batters_2016 = batters[batters["yearID"] == 2016]
relief_pitching_2016 = relief_pitchers[relief_pitchers["yearID"] == 2016]

# get series with mean starting pitcher salaries from 2016
mean_team_starters_2016 = starting_pitching_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter salaries from 2016
mean_team_batters_2016 = batters_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter salaries from 2016
mean_team_relievers_2016 = relief_pitching_2016.groupby("teamID")["standard_salary"].mean()

# get series with mean batter plus starting pitcher salaries
mean_team_batters_starters_2016 = mean_team_starters_2016 + mean_team_batters_2016 + mean_team_relievers_2016

# get series with proportion of payroll spent on starting pitching
prop_starters = mean_team_starters_2016/mean_team_batters_starters_2016

# get series with proportion of payroll spent on starting pitching
prop_batters = mean_team_batters_2016/mean_team_batters_starters_2016

prop_relievers = mean_team_relievers_2016/mean_team_batters_starters_2016

# create dataframe with proportions
prop_df = pd.concat([prop_batters, team_2016], axis=1)
prop_df = prop_df.rename(columns={'Win Percentage': 'win percentage 2016', 'standard_salary':'batter 2016'})
prop_df =  pd.concat([prop_df, prop_starters], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'starter 2016'})

prop_df =  pd.concat([prop_df, prop_relievers], axis=1)
prop_df = prop_df.rename(columns={'standard_salary':'reliever 2016'})

# do the same for all years 2012 to 2015
for year in range(2012, 2016):
    team_year = team_winning_pct[year]
    starting_pitching_year = starting_pitchers[starting_pitchers["yearID"] == year]
    batters_year = batters[batters["yearID"] == year]
    relief_pitching_year = relief_pitchers[relief_pitchers["yearID"] == year]
    
    # get series with mean starting pitcher salaries from each year
    mean_team_starters_year = starting_pitching_year.groupby("teamID")["standard_salary"].mean()

    # get series with mean batter salaries from each year
    mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
    
    # get series with mean batter salaries from each year
    mean_team_batters_year = batters_year.groupby("teamID")["standard_salary"].mean()
    
    mean_team_relievers_year = relief_pitching_year.groupby("teamID")["standard_salary"].mean()

    # get series with mean batter plus starting pitcher salaries
    mean_team_batters_starters_year = mean_team_starters_year + mean_team_batters_year + mean_team_relievers_year

    # get series with proportion of payroll spent on starting pitching
    prop_starters = mean_team_starters_year/mean_team_batters_starters_year

    # get series with proportion of payroll spent on starting pitching
    prop_batters = mean_team_batters_year/mean_team_batters_starters_year
    
    prop_relievers = mean_team_relievers_year/mean_team_batters_starters_year
    
    # add to proportions dataframe
    prop_df = pd.concat([prop_df, prop_batters, team_year], axis=1)
    win_col = "win percentage " + str(year)
    batter_col = "batter " + str(year)
    starter_col = "starter " + str(year)
    reliever_col = "reliever " + str(year)
    prop_df = prop_df.rename(columns={'Win Percentage': win_col, 'standard_salary':batter_col})
    prop_df =  pd.concat([prop_df, prop_starters], axis=1)
    prop_df = prop_df.rename(columns={'standard_salary':starter_col})
    prop_df =  pd.concat([prop_df, prop_relievers], axis=1)
    prop_df = prop_df.rename(columns={'standard_salary':reliever_col})
display(prop_df.head())

# get correlations between win percentage, and starter and batter salary proportion
batter_corrs = []
starter_corrs = []
reliever_corrs = []
ind = []
for year in range(2012,2017):
    win_year = "win percentage " + str(year)
    batter_year = "batter " + str(year)
    starter_year = "starter " + str(year)
    reliever_year = "reliever " + str(year)
    batter_corr = prop_df[win_year].corr(prop_df[batter_year])
    starter_corr = prop_df[win_year].corr(prop_df[starter_year])
    reliever_corr = prop_df[win_year].corr(prop_df[reliever_year])
    batter_corrs = batter_corrs + [batter_corr]
    starter_corrs = starter_corrs + [starter_corr]
    reliever_corrs = reliever_corrs + [reliever_corr]
    ind = ind + [year]

# make a series for each set of correlations indexed by year
batter_correlations = pd.Series(batter_corrs, index=ind)
starter_correlations = pd.Series(starter_corrs, index=ind)
reliever_correlations = pd.Series(reliever_corrs, index=ind)

# plot the correlations on a line graph
correlations_plot = batter_correlations.plot.line(legend=True)
starter_correlations.plot.line(ax=correlations_plot, legend=True)
reliever_correlations.plot.line(ax=correlations_plot, legend=True)
correlations_plot.set_title("Salary Proportion and Winning Percentage Correlations From 2012 to 2016")
correlations_plot.set_ylabel("Correlation")
correlations_plot.set_xlabel("Year")
correlations_plot.legend(["Batters", "Starting Pitchers", "Relief Pitchers"])

<matplotlib.legend.Legend at 0x7f478ab28198>

This graph shows the correlation between proportion of salary spent on batters and winning percentage over time and correlation between proportion of salary spent on starting pitchers and winning percentage over time. From this graph, it is clear to see that over the past five years, the correlation between paying starting pitchers and winning has trended up, and the correlation between paying batters and winning increased for a couple years but is now trending down. The correlation between relief pitchers and winning is also on this graph, but as we could have guessed from out earlier conclusion, it is trending way down. We do not have any salary data for any year after 2017, but if this was the year 2017 it is clear that it is best to pay starting pitchers more money because recently spending money on starting pitching has paid off.

Build K Nearest Neighbors Predictive Model¶

We want to have a model that can predict win percentage based on the proportion of payroll a team uses at each position

# create new dataframe with salarie proportions and winning percentage
winning_percentages = []
relief_pitcher_sal = []
starting_pitcher_sal = []
batter_sal = []
for row in prop_df.iterrows():
    for year in range(2012,2017):
        win_year = "win percentage " + str(year)
        batter_year = "batter " + str(year)
        starter_year = "starter " + str(year)
        reliever_year = "reliever " + str(year)
        winning_percentages = winning_percentages + [row[1][win_year]]
        batter_sal = batter_sal + [row[1][batter_year]]
        starting_pitcher_sal = starting_pitcher_sal + [row[1][starter_year]]
        relief_pitcher_sal = relief_pitcher_sal + [row[1][reliever_year]]
data = {"relief_pitcher_proportion":relief_pitcher_sal, "starting_pitcher_proportion":starting_pitcher_sal, 
        "batter_proportion": batter_sal, "winning_percentage":winning_percentages}
salary_proportions = pd.DataFrame(data)
display(salary_proportions.head())

# Make model using newly created dataframe
salary_proportions

# set features and training data
features = ["relief_pitcher_proportion", "starting_pitcher_proportion", "batter_proportion"]
Xtrain = pd.get_dummies(salary_proportions[features])
Ytrain = salary_proportions["winning_percentage"]

# Create series with the given data
Xnew = pd.Series(index=Xtrain.columns)
Xnew["starting_pitcher_proportion"] = 0.5
Xnew["batter_proportion"] = 0.5
Xnew["relief_pitcher_proportion"] = 0


# Standardize the variables.
Xtrain_mean = Xtrain.mean()
Xtrain_std = Xtrain.std()

Xtrain_sc = (Xtrain - Xtrain_mean) / Xtrain_std
Xnew_sc = (Xnew - Xtrain_mean) / Xtrain_std

# Find index of 30 nearest neighbors.
dists = np.sqrt(((Xtrain_sc - Xnew_sc) ** 2).sum(axis=1))
i_nearest = dists.sort_values()[:30].index

# Average the labels of these 30 nearest neighbors
Ytrain.loc[i_nearest].mean()

0.5062929734427318

According to this model, if you spend the mean salary for they year on relief pitching, and then split the rest of your salary between starting pitching and batting, you will win approximately half of your games.

Conclusion¶

My original hypothesis was that starting pitchers are paid the most, followed by batters and then relief pitchers. I was correct in that recently starting pitchers have been getting paid the most, followed closely by batters, with relief pitchers way below them in terms of average salary. I also believed that teams value earned run average over other statistics, but I was wrong and it turns out that actually strikeouts are the most valued statistic by teams when they pay their pitchers. I also went on to discover that paying relief pitchers too much money can really hurt a teams chances of winning, and it is best to pay starting pitchers because that will bring you the most wins. Starting pitchers and batters are both paid fairly, neither are overpaid or underpaid. Therefore, starting pitchers are the most valued position, but batters are still very valuable, so you should try to pay those two positions a good part of your payroll, while trying to minimize the amount of money spent on relief pitching. This is supported by the Washington Nationals, who just won the World Series even though their relief pitchers were ranked at the bottom of the league.

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	CG	...	BK	BFP	GF	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary
24986	ackerji01	1985	1	TOR	AL	7	2	61	0	0	...	0	370.0	26	35	1.0	2.0	16.0	Jim	Acker	-0.892167
24987	agostju01	1985	1	CHA	AL	4	3	54	0	0	...	0	246.0	21	27	3.0	3.0	4.0	Juan	Agosto	-0.957703
24989	alexado01	1985	1	TOR	AL	17	10	36	36	6	...	0	1090.0	0	105	6.0	3.0	15.0	Doyle	Alexander	1.161306
24990	allenne01	1985	1	SLN	NL	1	4	23	1	0	...	1	135.0	13	22	1.0	3.0	3.0	Neil	Allen	0.797215
24991	allenne01	1985	2	NYA	AL	1	0	17	0	0	...	0	124.0	10	9	0.0	0.0	3.0	Neil	Allen	0.797215

	playerID	yearID	stint	teamID	lgID	G	AB	R	H	2B	...	BB	SO	IBB	HBP	SH	SF	GIDP	nameFirst	nameLast	standard_salary
2	aguaylu01	1985	1	PHI	NL	91	165	27	46	7	...	22	26.0	5.0	6.0	4.0	3.0	7.0	Luis	Aguayo	-0.697014
6	almonbi01	1985	1	PIT	NL	88	244	33	66	17	...	22	61.0	0.0	1.0	4.0	3.0	6.0	Bill	Almon	-0.644585
9	armasto01	1985	1	BOS	AL	103	385	50	102	17	...	18	90.0	4.0	2.0	0.0	5.0	14.0	Tony	Armas	1.277815
10	ashbyal01	1985	1	HOU	NL	65	189	20	53	8	...	24	27.0	2.0	1.0	1.0	1.0	9.0	Alan	Ashby	-0.173693
13	backmwa01	1985	1	NYN	NL	145	520	77	142	24	...	36	72.0	1.0	1.0	14.0	3.0	3.0	Wally	Backman	-0.804785

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	CG	...	BFP	GF	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary
24989	alexado01	1985	1	TOR	AL	17	10	36	36	6	...	1090.0	0	105	6.0	3.0	15.0	Doyle	Alexander	1.161306
24993	andujjo01	1985	1	SLN	NL	21	12	38	38	10	...	1127.0	0	113	11.0	4.0	32.0	Joaquin	Andujar	1.612779
24999	bannifl01	1985	1	CHA	AL	10	14	34	34	4	...	928.0	0	121	9.0	8.0	20.0	Floyd	Bannister	0.975619
25000	barkele01	1985	1	ATL	NL	2	9	20	18	0	...	335.0	1	55	4.0	1.0	4.0	Len	Barker	1.146742
25004	beattji01	1985	1	SEA	AL	5	6	18	15	1	...	335.0	1	61	0.0	5.0	5.0	Jim	Beattie	-0.018349

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	...	BK	BFP	GF	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary
24986	ackerji01	1985	1	TOR	AL	7	2	61	0	...	0	370.0	26	35	1.0	2.0	16.0	Jim	Acker	-0.892167
24987	agostju01	1985	1	CHA	AL	4	3	54	0	...	0	246.0	21	27	3.0	3.0	4.0	Juan	Agosto	-0.957703
24990	allenne01	1985	1	SLN	NL	1	4	23	1	...	1	135.0	13	22	1.0	3.0	3.0	Neil	Allen	0.797215
24991	allenne01	1985	2	NYA	AL	1	0	17	0	...	0	124.0	10	9	0.0	0.0	3.0	Neil	Allen	0.797215
24992	anderla02	1985	1	PHI	NL	3	3	57	0	...	1	318.0	19	41	3.0	1.0	4.0	Larry	Andersen	-0.657692

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	CG	...	BFP	GF	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary	win percentage
41666	medlekr01	2012	1	ATL	NL	10	1	50	12	2	...	520.0	7	26	1.0	0.0	13.0	Kris	Medlen	-0.630196	0.833333
43717	greinza01	2015	1	LAN	NL	19	3	32	32	1	...	843.0	0	43	6.0	2.0	15.0	Zack	Greinke	3.759182	0.593750
41738	paulife01	2012	1	KCA	AL	3	1	7	7	0	...	156.0	0	8	1.0	1.0	1.0	Felipe	Paulino	-0.330853	0.428571
44618	kershcl01	2016	1	LAN	NL	12	4	21	21	3	...	544.0	0	31	4.0	1.0	9.0	Clayton	Kershaw	4.848037	0.571429
43318	smylydr01	2014	2	TBA	AL	3	1	7	7	1	...	173.0	0	9	1.0	0.0	3.0	Drew	Smyly	-0.671235	0.428571
42032	buchhcl01	2013	1	BOS	AL	12	1	16	16	1	...	416.0	0	23	1.0	2.0	14.0	Clay	Buchholz	0.408289	0.750000
43033	kershcl01	2014	1	LAN	NL	21	3	27	27	6	...	749.0	0	42	6.0	1.0	15.0	Clayton	Kershaw	0.003793	0.777778
43453	arrieja01	2015	1	CHN	NL	22	6	33	33	4	...	870.0	0	52	4.0	1.0	15.0	Jake	Arrieta	-0.121913	0.666667
44568	hillri01	2016	2	LAN	NL	3	2	6	6	0	...	128.0	0	7	1.0	1.0	0.0	Rich	Hill	0.271793	0.500000
42301	kershcl01	2013	1	LAN	NL	16	9	33	33	3	...	908.0	0	55	8.0	3.0	20.0	Clayton	Kershaw	1.617047	0.484848

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	CG	...	BFP	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary	win percentage
43821	kershcl01	2015	1	LAN	NL	16	7	33	33	4	...	890.0	62	4.0	0.0	8.0	Clayton	Kershaw	5.134183	0.484848
44916	scherma01	2016	1	WAS	NL	20	7	34	34	1	...	902.0	77	7.0	3.0	6.0	Max	Scherzer	3.007854	0.588235
42109	darviyu01	2013	1	TEX	AL	13	9	32	32	0	...	841.0	68	0.0	5.0	17.0	Yu	Darvish	1.163763	0.406250
44108	scherma01	2015	1	WAS	NL	14	12	33	33	4	...	899.0	74	11.0	2.0	5.0	Max	Scherzer	2.332058	0.424242
44097	salech01	2015	1	CHA	AL	13	11	31	31	1	...	854.0	88	2.0	3.0	13.0	Chris	Sale	0.308512	0.419355
43040	klubeco01	2014	1	CLE	AL	18	9	34	34	3	...	951.0	72	5.0	2.0	16.0	Corey	Kluber	-0.672399	0.529412
45006	verlaju01	2016	1	DET	AL	16	9	34	34	2	...	903.0	81	4.0	7.0	8.0	Justin	Verlander	4.000584	0.470588
44460	fernajo02	2016	1	MIA	NL	16	8	29	29	0	...	737.0	63	7.0	5.0	14.0	Jose	Fernandez	-0.270576	0.551724
43450	archech01	2015	1	TBA	AL	12	13	34	34	1	...	868.0	85	2.0	2.0	10.0	Chris	Archer	-0.569289	0.352941
43297	scherma01	2014	1	DET	AL	18	5	33	33	1	...	904.0	80	4.0	8.0	15.0	Max	Scherzer	2.239339	0.545455

	playerID	yearID	stint	teamID	lgID	W	L	G	GS	CG	...	BFP	GF	R	SH	SF	GIDP	nameFirst	nameLast	standard_salary	win percentage
43124	mejiaje01	2014	1	NYN	NL	6	6	63	7	0	...	417.0	49	41	2.0	0.0	10.0	Jenrry	Mejia	-0.673238	0.857143
44760	nicasju01	2016	1	PIT	NL	10	7	52	12	0	...	513.0	9	64	5.0	7.0	8.0	Juan	Nicasio	-0.236678	0.833333
41666	medlekr01	2012	1	ATL	NL	10	1	50	12	2	...	520.0	7	26	1.0	0.0	13.0	Kris	Medlen	-0.630196	0.833333
44017	priceda01	2015	2	TOR	AL	9	1	11	11	0	...	296.0	0	20	0.0	3.0	4.0	David	Price	2.805707	0.818182
43305	shoemma01	2014	1	LAA	AL	16	4	27	20	0	...	543.0	5	49	3.0	5.0	9.0	Matt	Shoemaker	-0.675018	0.800000
43033	kershcl01	2014	1	LAN	NL	21	3	27	27	6	...	749.0	0	42	6.0	1.0	15.0	Clayton	Kershaw	0.003793	0.777778
43915	medlekr01	2015	1	KCA	AL	6	2	15	8	0	...	243.0	2	30	0.0	1.0	4.0	Kris	Medlen	-0.417944	0.750000
42032	buchhcl01	2013	1	BOS	AL	12	1	16	16	1	...	416.0	0	23	1.0	2.0	14.0	Clay	Buchholz	0.408289	0.750000
44324	bundydy01	2016	1	BAL	AL	10	6	36	14	0	...	474.0	6	52	1.0	1.0	6.0	Dylan	Bundy	-0.440067	0.714286
44165	tomlijo01	2015	1	CLE	AL	7	2	10	10	2	...	251.0	0	22	0.0	0.0	3.0	Josh	Tomlin	-0.508751	0.700000

	yearID	lgID	teamID	franchID	divID	Rank	G	Ghome	W	L	...	FP	name	park	attendance	BPF	PPF	teamIDBR	teamIDlahman45	teamIDretro	Win Percentage
2685	2012	NL	ARI	ARI	W	3	162	81.0	81	81	...	0.985	Arizona Diamondbacks	Chase Field	2177617.0	105	106	ARI	ARI	ARI	0.500000
2686	2012	NL	ATL	ATL	E	2	162	81.0	94	68	...	0.986	Atlanta Braves	Turner Field	2420171.0	102	101	ATL	ATL	ATL	0.580247
2687	2012	AL	BAL	BAL	E	2	162	81.0	93	69	...	0.983	Baltimore Orioles	Oriole Park at Camden Yards	2102240.0	102	103	BAL	BAL	BAL	0.574074
2688	2012	AL	BOS	BOS	E	5	162	81.0	69	93	...	0.983	Boston Red Sox	Fenway Park II	3043003.0	106	106	BOS	BOS	BOS	0.425926
2689	2012	AL	CHA	CHW	C	2	162	81.0	85	77	...	0.988	Chicago White Sox	U.S. Cellular Field	1965955.0	106	106	CHW	CHA	CHA	0.524691

	batter 2016	win percentage 2016	starter 2016	reliever 2016	batter 2012	win percentage 2012	starter 2012	reliever 2012	batter 2013	win percentage 2013	starter 2013	reliever 2013	batter 2014	win percentage 2014	starter 2014	reliever 2014	batter 2015	win percentage 2015	starter 2015	reliever 2015
teamID
ARI	6.281737	0.425926	-10.539830	5.258093	-0.193050	0.500000	0.564318	0.628732	0.142761	0.500000	0.431176	0.426063	-0.941504	0.395062	0.173635	1.767869	0.213295	0.487654	0.418064	0.368642
ATL	-0.197719	0.422360	0.602862	0.594857	-0.658569	0.580247	0.393583	1.264986	-1.187657	0.592593	-0.093550	2.281207	-1.373686	0.487654	0.429272	1.944414	-1.116344	0.413580	2.624507	-0.508164
BAL	1.043369	0.549383	0.579936	-0.623305	-0.033096	0.574074	0.552530	0.480566	-2.335152	0.524691	0.436545	2.898606	-1.816925	0.592593	0.305406	2.511519	0.031172	0.500000	-3.507200	4.476027
BOS	0.320255	0.574074	0.845885	-0.166140	0.989502	0.425926	0.536711	-0.526213	0.312833	0.598765	0.797508	-0.110340	0.493430	0.438272	0.783125	-0.276555	1.069860	0.481481	0.622152	-0.692013
CHA	0.081412	0.481481	0.966871	-0.048284	0.582296	0.524691	0.453693	-0.035989	0.645576	0.388889	0.729358	-0.374934	-0.598662	0.450617	2.858416	-1.259753	0.025301	0.469136	1.146613	-0.171915