### Hey, Baseball Fans: Winning Takes Money

Investing and professional sports have a lot in common--competition, winners and losers, uncertain outcomes, lots of data, and a wide range of opinions among participants, spectators and analysts. During a conversation the other day with a friend, I casually mentioned what I thought to be an accepted truism in the sport--that, just as money is a vitally important determinant in the business world, Major League Baseball teams with higher payroll (hence, better players by presumption) ought to win more often than teams with lower payroll.

To my surprise, my friend, who is a baseball fanatic, retorted that money and winning are not as intimately linked as one might presume, and proceeded to recite from his encyclopedic memory a number of examples of World Series play over the past 10 years--the Arizona Diamondbacks over the New York Yankees in 2001, the Los Angeles Angels over the San Francisco Giants in 2002, and the Florida Marlins over the Yankees in 2003--all cases in which teams with significantly

After getting off the phone, I did a quick web search to check further. The first study I came across stated that "results from the two years of data [2002 and 2003] indicate that there is no real correlation between a team's salary and its win percentage." In other words, higher salaries do not significantly boost win percentage. Hmm--strike two, I mused. . . .

Wanting to avoid striking out, I resolved to find the data and run numbers myself.

The USA Today Salaries Database gives MLB payroll figures for all 30 pro baseball teams in both the American and National leagues going back to 1988. The ESPN MLB standings database shows seasonal win percentages from 2002. Combining the data for the seven years from 2002 to 2008, we can generate the scatter plot shown below.

A least-squares analysis of team payroll versus win percentage gives the "best fit" regression line:

Win Percentage = 0.426 + (Team Payroll in $ Millions) x 0.00097,

indicating that approximately

It is also instructive to look at the data on a team-by-team basis for the same seven-year period from 2002 to 2008. Notice how the New York Yankees and the Boston Red Sox have not only the first and second highest average team payrolls ($181 million and $122 million) but also the first and second highest average win percentages (0.600 and 0.580), respectively. At the other extreme, the three teams with the lowest average win percentages--Kansas City Royals at 0.410, Tampa Bay Rays at 0.423, and Pittsburgh Pirates at 0.431--are among the five Major League teams with the lowest average team payroll (each less than $50 million).

I also provide a table showing the payroll of baseball teams playing in the World Series over the past 20 years (actually from 1988 through 2008, with the exception of 1994 when, as baseball fans will recall, the Series was cancelled due to a player strike), assisted by data from Baseball Almanac. The results reveal that in 14 out of the 20 years, or 70% of the time, the team with the higher team payroll defeated the team with the lower payroll in the World Series. This result is consistent with the strong relationship between team payroll and win percentage shown in the graphs above.

What I conclude is that money

In case anyone is wondering why my conclusion differs so radically from the study I mentioned as being my "strike two," I provide an explanation here. Warning: Only those interested in statistical analysis should continue reading, since the discussion becomes somewhat technical. However, I encourage anyone who at least occasionally spends time looking for patterns in data to read on, since an important lesson in applying the right tools to the job at hand will arise from the detail.

The author of the study I cited chose to analyze that data using a multiple regression, in an effort to determine how each of three variables--starting pitchers' salaries (P), fielders' salaries (F) and closing pitcher's salary (C)--affects a baseball team's win percentage. For example, for 2003, the study produced the following regression result,

Win Percentage = 0.406 + 0.0022 x P + 0.0015 x F + 0.0018 x C,

along with corresponding t-statistics of 1.72, 1.46 and 0.41 for the significance of the regression coefficients corresponding to independent variables P, F and C, respectively. With all t-statistics less than 2.00, the study was unable to discern at the standard minimum of 95% confidence any dependence of win percentage on the three payroll variables.

Interestingly enough, when I perform the analysis using the same 2003 data, but formulating the problem as three separate one-variable single regressions (instead of one comprehensive three-variable multiple regression as employed in the study), I arrive at t-statistics of 2.93 for dependence of win percentage on starting pitchers' salaries, 2.77 for dependence on fielders' salaries, and 1.49 for dependence on closing pitcher's salary--all higher than the t-statistics for the multiple regression given above. Further, if I combine starting pitchers', fielders' and closing pitcher's salaries into a single variable (i.e., P+F+C) and again run a one-variable regression, I find an even higher t-statistic, namely, 3.49.

In other words, by "zooming out" and viewing the data using an effectively lower resolution microscope, we actually find a more robust statistical pattern--this is reminiscent of the proverbial necessity of stepping back from the individual trees in order to view the grander forest. But, you might be wondering, how can this be? How is it possible in a regression to see a pattern at a lower resolution that essentially disappears at a higher resolution?

To understand the mechanism behind this paradoxical statistical behavior, consider a very simple regression example. Suppose we are trying to understand the relationship between a dependent variable, z, and two independent variables, x and y, based on five data points:

Data point 1: x = 1, y = 1 and z = 1

Data point 2: x = 2, y = 2 and z = 2

Data point 3: x = 3, y = 3 and z = 3

Data point 4: x = 4, y = 5 and z = 4

Data point 5: x = 5, y = 4 and z = 4.

Graphically, three plots are relevant:

a) Multiple Regression: Three-dimensional plot of x and y versus z,

b) Single Regression: Two-dimensional plot of x versus z (same as y versus z), and

c) Single Regression: Two-dimensional plot of combined variable, x+y, versus z.

In the multiple regression, the t-statistics are 3.3 for each of x and y. Observe the "dispersion" of data points 4 and 5 in the three-dimensional plot, with each of these points offset in a different direction from the straight line that can be drawn through data points 1, 2 and 3. This dispersion adds extra error to the regression, creating a relatively poor regression fit to the data.

In the single regression of x versus z (or, symetrically, y versus z), four of the five data points are collinear, and only the fifth data point introduces error into the otherwise perfect linear fit. This tighter fit of the data to a straight line yields a t-statistic of 6.9, higher than in the multiple regression case.

Still better yet, if we regress on the

In an analogous way, the baseball statistics study relying on multiple regression produces a poorer picture of the relationships between variables than does the single regression. Behind the scenes is probably a mechanism akin to the following: Owners and managers of a given baseball team work within budget constraints during any particular season, so that the total amount of money available to pay all players on the team may be viewed effectively as a fixed quantity for that year. If more money is spent paying starting pitchers, then less money is available to hire and pay fielders and closers. Similar to how in the simple example above, x is less than y at data point 4, but y is less than x at data point 5, a particular baseball team may decide to spend less of its budget on

When the salaries of the all pitchers and fielders are combined, a more meaningful variable results against which to regress the win percentages. For this reason, the single regression using the combined salaries produces a higher t-statistic and better fit to the linear regression model.

The basic lesson here is that, when analyzing problems, it helps always to look for simpler relationships, explanations and solutions first, before implementing more sophisticated analytical tools. In working with scientific, financial, economic, sports or any other type of data, we are often warned against fabricating false patterns (artifacts of the analysis) by overfitting data to a model. In a similar vein, our discussion shows how it is also sometimes possible to overlook robust patterns by forcing an overly complicated model onto an intrinsically simpler set of data.

To my surprise, my friend, who is a baseball fanatic, retorted that money and winning are not as intimately linked as one might presume, and proceeded to recite from his encyclopedic memory a number of examples of World Series play over the past 10 years--the Arizona Diamondbacks over the New York Yankees in 2001, the Los Angeles Angels over the San Francisco Giants in 2002, and the Florida Marlins over the Yankees in 2003--all cases in which teams with significantly

*lower*payroll took the championship from their more generously compensated opponents. All right, I had to admit, I take "strike one" against my follow-the-money presumption.After getting off the phone, I did a quick web search to check further. The first study I came across stated that "results from the two years of data [2002 and 2003] indicate that there is no real correlation between a team's salary and its win percentage." In other words, higher salaries do not significantly boost win percentage. Hmm--strike two, I mused. . . .

Wanting to avoid striking out, I resolved to find the data and run numbers myself.

**Team Payroll and Win Percentage Data**The USA Today Salaries Database gives MLB payroll figures for all 30 pro baseball teams in both the American and National leagues going back to 1988. The ESPN MLB standings database shows seasonal win percentages from 2002. Combining the data for the seven years from 2002 to 2008, we can generate the scatter plot shown below.

A least-squares analysis of team payroll versus win percentage gives the "best fit" regression line:

Win Percentage = 0.426 + (Team Payroll in $ Millions) x 0.00097,

indicating that approximately

**each one million dollars of team payroll adds about 1 point out of 1,000 (i.e., 0.001) to the win percentage**. The t-statistic for the regression is 6.96, which means that we can state this relationship between payroll and win percentage with an extremely high degree of confidence (in fact, the likelihood of a false positive is less than one in ten billion!).It is also instructive to look at the data on a team-by-team basis for the same seven-year period from 2002 to 2008. Notice how the New York Yankees and the Boston Red Sox have not only the first and second highest average team payrolls ($181 million and $122 million) but also the first and second highest average win percentages (0.600 and 0.580), respectively. At the other extreme, the three teams with the lowest average win percentages--Kansas City Royals at 0.410, Tampa Bay Rays at 0.423, and Pittsburgh Pirates at 0.431--are among the five Major League teams with the lowest average team payroll (each less than $50 million).

I also provide a table showing the payroll of baseball teams playing in the World Series over the past 20 years (actually from 1988 through 2008, with the exception of 1994 when, as baseball fans will recall, the Series was cancelled due to a player strike), assisted by data from Baseball Almanac. The results reveal that in 14 out of the 20 years, or 70% of the time, the team with the higher team payroll defeated the team with the lower payroll in the World Series. This result is consistent with the strong relationship between team payroll and win percentage shown in the graphs above.

What I conclude is that money

*does*matter in professional baseball. Teams that have higher payroll generally*do*win more games, both during the regular season and during the World Series. Suffice it to say: the correlation between performance and pay is surely at least as high in baseball (and, in all likelihood, in other profesional sports as well) as it is in the business world. On a related though distinct topic, I would conjecture that, based on the relationship between payroll and win percentages, it is undoubtedly much easier to predict outcomes in Major League Baseball than in the stock market and other financial markets.**A Note on Statistical Analysis**In case anyone is wondering why my conclusion differs so radically from the study I mentioned as being my "strike two," I provide an explanation here. Warning: Only those interested in statistical analysis should continue reading, since the discussion becomes somewhat technical. However, I encourage anyone who at least occasionally spends time looking for patterns in data to read on, since an important lesson in applying the right tools to the job at hand will arise from the detail.

The author of the study I cited chose to analyze that data using a multiple regression, in an effort to determine how each of three variables--starting pitchers' salaries (P), fielders' salaries (F) and closing pitcher's salary (C)--affects a baseball team's win percentage. For example, for 2003, the study produced the following regression result,

Win Percentage = 0.406 + 0.0022 x P + 0.0015 x F + 0.0018 x C,

along with corresponding t-statistics of 1.72, 1.46 and 0.41 for the significance of the regression coefficients corresponding to independent variables P, F and C, respectively. With all t-statistics less than 2.00, the study was unable to discern at the standard minimum of 95% confidence any dependence of win percentage on the three payroll variables.

Interestingly enough, when I perform the analysis using the same 2003 data, but formulating the problem as three separate one-variable single regressions (instead of one comprehensive three-variable multiple regression as employed in the study), I arrive at t-statistics of 2.93 for dependence of win percentage on starting pitchers' salaries, 2.77 for dependence on fielders' salaries, and 1.49 for dependence on closing pitcher's salary--all higher than the t-statistics for the multiple regression given above. Further, if I combine starting pitchers', fielders' and closing pitcher's salaries into a single variable (i.e., P+F+C) and again run a one-variable regression, I find an even higher t-statistic, namely, 3.49.

In other words, by "zooming out" and viewing the data using an effectively lower resolution microscope, we actually find a more robust statistical pattern--this is reminiscent of the proverbial necessity of stepping back from the individual trees in order to view the grander forest. But, you might be wondering, how can this be? How is it possible in a regression to see a pattern at a lower resolution that essentially disappears at a higher resolution?

To understand the mechanism behind this paradoxical statistical behavior, consider a very simple regression example. Suppose we are trying to understand the relationship between a dependent variable, z, and two independent variables, x and y, based on five data points:

Data point 1: x = 1, y = 1 and z = 1

Data point 2: x = 2, y = 2 and z = 2

Data point 3: x = 3, y = 3 and z = 3

Data point 4: x = 4, y = 5 and z = 4

Data point 5: x = 5, y = 4 and z = 4.

Graphically, three plots are relevant:

a) Multiple Regression: Three-dimensional plot of x and y versus z,

b) Single Regression: Two-dimensional plot of x versus z (same as y versus z), and

c) Single Regression: Two-dimensional plot of combined variable, x+y, versus z.

In the multiple regression, the t-statistics are 3.3 for each of x and y. Observe the "dispersion" of data points 4 and 5 in the three-dimensional plot, with each of these points offset in a different direction from the straight line that can be drawn through data points 1, 2 and 3. This dispersion adds extra error to the regression, creating a relatively poor regression fit to the data.

In the single regression of x versus z (or, symetrically, y versus z), four of the five data points are collinear, and only the fifth data point introduces error into the otherwise perfect linear fit. This tighter fit of the data to a straight line yields a t-statistic of 6.9, higher than in the multiple regression case.

Still better yet, if we regress on the

*combined*variable, x+y, we end up with a t-statistic of 17.9, substantially higher than in either of the other cases. By combining x and y into a single variable, we eliminate the oppositely directed "dispersive meandering" of x and y. The combined variable allows the regression analysis to reveal a closer correspondence between the independent variable (x+y) and the dependent variable (z).**Back to Baseball . . . and a Lesson**In an analogous way, the baseball statistics study relying on multiple regression produces a poorer picture of the relationships between variables than does the single regression. Behind the scenes is probably a mechanism akin to the following: Owners and managers of a given baseball team work within budget constraints during any particular season, so that the total amount of money available to pay all players on the team may be viewed effectively as a fixed quantity for that year. If more money is spent paying starting pitchers, then less money is available to hire and pay fielders and closers. Similar to how in the simple example above, x is less than y at data point 4, but y is less than x at data point 5, a particular baseball team may decide to spend less of its budget on

*starting pitchers*than fielders, while another team may decide to flip the allocation the other way around, with less of its budget going to fielders than starting pitchers.When the salaries of the all pitchers and fielders are combined, a more meaningful variable results against which to regress the win percentages. For this reason, the single regression using the combined salaries produces a higher t-statistic and better fit to the linear regression model.

The basic lesson here is that, when analyzing problems, it helps always to look for simpler relationships, explanations and solutions first, before implementing more sophisticated analytical tools. In working with scientific, financial, economic, sports or any other type of data, we are often warned against fabricating false patterns (artifacts of the analysis) by overfitting data to a model. In a similar vein, our discussion shows how it is also sometimes possible to overlook robust patterns by forcing an overly complicated model onto an intrinsically simpler set of data.

## 33 Comments:

This comment has been removed by the author.

Given the graph you show of a linear regression relating team win percentage to team payroll, it seems likely that a relatively simple nonlinear function would fit the data better. It's also interesting to note the considerable variation about the regression line.

Talent generally does follow the money. Another factor to consider is the endorsement benefit that a player gains by playing in a high profile media city. Large cities have an advantage in earnings potential and endorsement potential which allows more discretionary monies to be available to attract top players. Big city, more money, better players equals on average more wins. Fans intuitively know this. There are exceptions such as the Cubs . It is not always about the money. A team with a history of winning will be able to attract better players and thus perpetuate the trend of winning.

what about the curse? all the money in the world won't help the Cubs.

you article is very interesting for me

wow, very interesting article

the blog is very good and useful for me.You very creative in preparing the article so interesting to read

Hi Mr. Sakazaki,

I have been regularly reading your Investment commentary. It is both very informative and presented in a creative manner. The website I represent, http://www.finpipe.com regularly seeks contributors for its pages to provide viewers with up to date investment information. Your financial knowledge would be a great asset to our website. Your article contribution would feature any personal links you desired. If you are interested or for more information, please email me at finpipe@gmail.com. I look forward to hearing from you.

While I have trouble with all those graphs it is true that on occasion a lower payroll team does slip into the upper tier the teams that spend money certainly do have a greater chance of winning, and usually do win. Just because the Yankees do not win the World Series does not change that fact.

酒店上班

酒店兼職

台北酒店

打工兼差

酒店工作

酒店經紀

禮服店

酒店兼差

酒店打工

酒店PT

酒店正職

酒店賺錢

酒店日領

ig city, more money, better players equals on average more wins. Fans intuitively know this. There are exceptions such as the Cubs . It is not always about the money. A team with a history of winning will be able to attract better players and thus perpetuate the trend of winning.

lol el moco!! I'm not really a baseball fan, but I decided to check out your blog cuz it seemed interesting. Thanks for the info! Maybe I'll renew my like for America's favorite past time!

酒店經紀

票貼

借錢

二胎

酒店

agile software development team | java software company | java web application | java software outsourcing | BlackBerry application development

八大娛樂全套-指油壓半套

高雄汽車借款.高雄當鋪.吉成當鋪

高雄當鋪.高雄汽車借款.易成當鋪

台南當鋪.台南汽車借款.信利當鋪

I casually mentioned what I thought to be an accepted truism in the sport--that, just as money is a vitally important determinant in the business world, Major League Baseball teams with higher payroll (hence, better players by presumption) ought to win more often than teams with lower payroll.

more covered call screener

I casually mentioned what I thought to be an accepted truism in the sport--that, just as money is a vitally important determinant in the business world, Major League Baseball teams with higher payroll (hence, better players by presumption) ought to win more often than teams with lower payroll.

more covered call screener

I like your article and it really gives an outstanding idea that is very helpful for all the people on web.

Thanks for sharing your post and it was superb .I would like to hear more from you in future too.

Very interesting information. I never thought about this before. http://www.skycoequipment.com

Thanks for posting this tip, I'll have to give it a try next time I'm out on the course.

Hey! This is great idea. I never thought it. Thanks for the share.

Industrial Chemical

Water Oil Separator

I have a web site where I give advise on penny stocks and stocks under five dollars. I have many many years of experience with these type of stocks. I would like to take a moment to talk about low price stocks not classic penny stocks or stocks under one dollar the term most people most often think of when the word penny stock is used. Their are companies of really decent quality trading under five dollars’ but for every one company trading under five dollars that is of decent quality their are maybe ten of poor quality. So the really big difference between those investors that are tremendously successfull when it comes to investing in low price stocks and those investors that lose enormous amounts of money investing in stocks under five dollars’ is having a great deal of knowledge and experience when it comes to low price stocks’ or having a total lack of knowledge and experience when it comes to low price stocks. Finding quality stocks under five dollars requires a lot more research than finding a decent stock above ten dollars.

Speaking of stocks under one dollar. I bought shares in vonage holdings corporation for 37 cents and recently sold them for just under 5 dollars

gold american eagles

I'm curious as to why you didn't include any R2 values. You cite lots of t-values, but all those tell you is how likely it is that your independent variable has an effect. However, it doesn't do anything to tell you how much variation is being explained by your model. I'd take a model with lower t-values and higher R2 anytime.

This is such a great resource that you are providing and you give it away for free. I love seeing

websites that understand the value of providing a quality resource for free. It’s the old what goes around comes around routine. Big thanks for the useful info.

I agree that investing and sports have a lot in common. There are winners and losers as well. But aside from that, both of these also employ strategies in order to win. As risks are inevitable in investing, there are various ways that you can minimize and therefore, increasing your chances of gaining. As the article "A Look at Diversification" stated, one of such strategies is by diversification. With it, you are identifying investments that may perform differently under various market conditions.

This article is mind blowing I read it and enjoyed. I always find this type of article to learn and gather knowledge.

Panele fotowoltaiczne

It sounds like an interesting series. I will check it out

book marketing

Hyderabad-based Dr Reddy's Laboratories surpassed analysts' expectations with consolidated net profit growing 69.4 percent year-on-year to Rs 690.3 crore in three-month period ended September 2013.

Moneycontrol

Moneycontroltips

Hello

The world of Internet users,

We would like bring your attention towards hassle free

online booking of Party Halls/Space/Banquets. Now its very easy.

Simply browse the site http://www.onlinekhana.comand send your enquiry.

Post a Comment

<< Home