### Hey, Baseball Fans: Winning Takes Money

Investing and professional sports have a lot in common--competition, winners and losers, uncertain outcomes, lots of data, and a wide range of opinions among participants, spectators and analysts. During a conversation the other day with a friend, I casually mentioned what I thought to be an accepted truism in the sport--that, just as money is a vitally important determinant in the business world, Major League Baseball teams with higher payroll (hence, better players by presumption) ought to win more often than teams with lower payroll.

To my surprise, my friend, who is a baseball fanatic, retorted that money and winning are not as intimately linked as one might presume, and proceeded to recite from his encyclopedic memory a number of examples of World Series play over the past 10 years--the Arizona Diamondbacks over the New York Yankees in 2001, the Los Angeles Angels over the San Francisco Giants in 2002, and the Florida Marlins over the Yankees in 2003--all cases in which teams with significantly

After getting off the phone, I did a quick web search to check further. The first study I came across stated that "results from the two years of data [2002 and 2003] indicate that there is no real correlation between a team's salary and its win percentage." In other words, higher salaries do not significantly boost win percentage. Hmm--strike two, I mused. . . .

Wanting to avoid striking out, I resolved to find the data and run numbers myself.

The USA Today Salaries Database gives MLB payroll figures for all 30 pro baseball teams in both the American and National leagues going back to 1988. The ESPN MLB standings database shows seasonal win percentages from 2002. Combining the data for the seven years from 2002 to 2008, we can generate the scatter plot shown below.

A least-squares analysis of team payroll versus win percentage gives the "best fit" regression line:

Win Percentage = 0.426 + (Team Payroll in $ Millions) x 0.00097,

indicating that approximately

It is also instructive to look at the data on a team-by-team basis for the same seven-year period from 2002 to 2008. Notice how the New York Yankees and the Boston Red Sox have not only the first and second highest average team payrolls ($181 million and $122 million) but also the first and second highest average win percentages (0.600 and 0.580), respectively. At the other extreme, the three teams with the lowest average win percentages--Kansas City Royals at 0.410, Tampa Bay Rays at 0.423, and Pittsburgh Pirates at 0.431--are among the five Major League teams with the lowest average team payroll (each less than $50 million).

I also provide a table showing the payroll of baseball teams playing in the World Series over the past 20 years (actually from 1988 through 2008, with the exception of 1994 when, as baseball fans will recall, the Series was cancelled due to a player strike), assisted by data from Baseball Almanac. The results reveal that in 14 out of the 20 years, or 70% of the time, the team with the higher team payroll defeated the team with the lower payroll in the World Series. This result is consistent with the strong relationship between team payroll and win percentage shown in the graphs above.

What I conclude is that money

In case anyone is wondering why my conclusion differs so radically from the study I mentioned as being my "strike two," I provide an explanation here. Warning: Only those interested in statistical analysis should continue reading, since the discussion becomes somewhat technical. However, I encourage anyone who at least occasionally spends time looking for patterns in data to read on, since an important lesson in applying the right tools to the job at hand will arise from the detail.

The author of the study I cited chose to analyze that data using a multiple regression, in an effort to determine how each of three variables--starting pitchers' salaries (P), fielders' salaries (F) and closing pitcher's salary (C)--affects a baseball team's win percentage. For example, for 2003, the study produced the following regression result,

Win Percentage = 0.406 + 0.0022 x P + 0.0015 x F + 0.0018 x C,

along with corresponding t-statistics of 1.72, 1.46 and 0.41 for the significance of the regression coefficients corresponding to independent variables P, F and C, respectively. With all t-statistics less than 2.00, the study was unable to discern at the standard minimum of 95% confidence any dependence of win percentage on the three payroll variables.

Interestingly enough, when I perform the analysis using the same 2003 data, but formulating the problem as three separate one-variable single regressions (instead of one comprehensive three-variable multiple regression as employed in the study), I arrive at t-statistics of 2.93 for dependence of win percentage on starting pitchers' salaries, 2.77 for dependence on fielders' salaries, and 1.49 for dependence on closing pitcher's salary--all higher than the t-statistics for the multiple regression given above. Further, if I combine starting pitchers', fielders' and closing pitcher's salaries into a single variable (i.e., P+F+C) and again run a one-variable regression, I find an even higher t-statistic, namely, 3.49.

In other words, by "zooming out" and viewing the data using an effectively lower resolution microscope, we actually find a more robust statistical pattern--this is reminiscent of the proverbial necessity of stepping back from the individual trees in order to view the grander forest. But, you might be wondering, how can this be? How is it possible in a regression to see a pattern at a lower resolution that essentially disappears at a higher resolution?

To understand the mechanism behind this paradoxical statistical behavior, consider a very simple regression example. Suppose we are trying to understand the relationship between a dependent variable, z, and two independent variables, x and y, based on five data points:

Data point 1: x = 1, y = 1 and z = 1

Data point 2: x = 2, y = 2 and z = 2

Data point 3: x = 3, y = 3 and z = 3

Data point 4: x = 4, y = 5 and z = 4

Data point 5: x = 5, y = 4 and z = 4.

Graphically, three plots are relevant:

a) Multiple Regression: Three-dimensional plot of x and y versus z,

b) Single Regression: Two-dimensional plot of x versus z (same as y versus z), and

c) Single Regression: Two-dimensional plot of combined variable, x+y, versus z.

In the multiple regression, the t-statistics are 3.3 for each of x and y. Observe the "dispersion" of data points 4 and 5 in the three-dimensional plot, with each of these points offset in a different direction from the straight line that can be drawn through data points 1, 2 and 3. This dispersion adds extra error to the regression, creating a relatively poor regression fit to the data.

In the single regression of x versus z (or, symetrically, y versus z), four of the five data points are collinear, and only the fifth data point introduces error into the otherwise perfect linear fit. This tighter fit of the data to a straight line yields a t-statistic of 6.9, higher than in the multiple regression case.

Still better yet, if we regress on the

In an analogous way, the baseball statistics study relying on multiple regression produces a poorer picture of the relationships between variables than does the single regression. Behind the scenes is probably a mechanism akin to the following: Owners and managers of a given baseball team work within budget constraints during any particular season, so that the total amount of money available to pay all players on the team may be viewed effectively as a fixed quantity for that year. If more money is spent paying starting pitchers, then less money is available to hire and pay fielders and closers. Similar to how in the simple example above, x is less than y at data point 4, but y is less than x at data point 5, a particular baseball team may decide to spend less of its budget on

When the salaries of the all pitchers and fielders are combined, a more meaningful variable results against which to regress the win percentages. For this reason, the single regression using the combined salaries produces a higher t-statistic and better fit to the linear regression model.

The basic lesson here is that, when analyzing problems, it helps always to look for simpler relationships, explanations and solutions first, before implementing more sophisticated analytical tools. In working with scientific, financial, economic, sports or any other type of data, we are often warned against fabricating false patterns (artifacts of the analysis) by overfitting data to a model. In a similar vein, our discussion shows how it is also sometimes possible to overlook robust patterns by forcing an overly complicated model onto an intrinsically simpler set of data.

To my surprise, my friend, who is a baseball fanatic, retorted that money and winning are not as intimately linked as one might presume, and proceeded to recite from his encyclopedic memory a number of examples of World Series play over the past 10 years--the Arizona Diamondbacks over the New York Yankees in 2001, the Los Angeles Angels over the San Francisco Giants in 2002, and the Florida Marlins over the Yankees in 2003--all cases in which teams with significantly

*lower*payroll took the championship from their more generously compensated opponents. All right, I had to admit, I take "strike one" against my follow-the-money presumption.After getting off the phone, I did a quick web search to check further. The first study I came across stated that "results from the two years of data [2002 and 2003] indicate that there is no real correlation between a team's salary and its win percentage." In other words, higher salaries do not significantly boost win percentage. Hmm--strike two, I mused. . . .

Wanting to avoid striking out, I resolved to find the data and run numbers myself.

**Team Payroll and Win Percentage Data**The USA Today Salaries Database gives MLB payroll figures for all 30 pro baseball teams in both the American and National leagues going back to 1988. The ESPN MLB standings database shows seasonal win percentages from 2002. Combining the data for the seven years from 2002 to 2008, we can generate the scatter plot shown below.

A least-squares analysis of team payroll versus win percentage gives the "best fit" regression line:

Win Percentage = 0.426 + (Team Payroll in $ Millions) x 0.00097,

indicating that approximately

**each one million dollars of team payroll adds about 1 point out of 1,000 (i.e., 0.001) to the win percentage**. The t-statistic for the regression is 6.96, which means that we can state this relationship between payroll and win percentage with an extremely high degree of confidence (in fact, the likelihood of a false positive is less than one in ten billion!).It is also instructive to look at the data on a team-by-team basis for the same seven-year period from 2002 to 2008. Notice how the New York Yankees and the Boston Red Sox have not only the first and second highest average team payrolls ($181 million and $122 million) but also the first and second highest average win percentages (0.600 and 0.580), respectively. At the other extreme, the three teams with the lowest average win percentages--Kansas City Royals at 0.410, Tampa Bay Rays at 0.423, and Pittsburgh Pirates at 0.431--are among the five Major League teams with the lowest average team payroll (each less than $50 million).

I also provide a table showing the payroll of baseball teams playing in the World Series over the past 20 years (actually from 1988 through 2008, with the exception of 1994 when, as baseball fans will recall, the Series was cancelled due to a player strike), assisted by data from Baseball Almanac. The results reveal that in 14 out of the 20 years, or 70% of the time, the team with the higher team payroll defeated the team with the lower payroll in the World Series. This result is consistent with the strong relationship between team payroll and win percentage shown in the graphs above.

What I conclude is that money

*does*matter in professional baseball. Teams that have higher payroll generally*do*win more games, both during the regular season and during the World Series. Suffice it to say: the correlation between performance and pay is surely at least as high in baseball (and, in all likelihood, in other profesional sports as well) as it is in the business world. On a related though distinct topic, I would conjecture that, based on the relationship between payroll and win percentages, it is undoubtedly much easier to predict outcomes in Major League Baseball than in the stock market and other financial markets.**A Note on Statistical Analysis**In case anyone is wondering why my conclusion differs so radically from the study I mentioned as being my "strike two," I provide an explanation here. Warning: Only those interested in statistical analysis should continue reading, since the discussion becomes somewhat technical. However, I encourage anyone who at least occasionally spends time looking for patterns in data to read on, since an important lesson in applying the right tools to the job at hand will arise from the detail.

The author of the study I cited chose to analyze that data using a multiple regression, in an effort to determine how each of three variables--starting pitchers' salaries (P), fielders' salaries (F) and closing pitcher's salary (C)--affects a baseball team's win percentage. For example, for 2003, the study produced the following regression result,

Win Percentage = 0.406 + 0.0022 x P + 0.0015 x F + 0.0018 x C,

along with corresponding t-statistics of 1.72, 1.46 and 0.41 for the significance of the regression coefficients corresponding to independent variables P, F and C, respectively. With all t-statistics less than 2.00, the study was unable to discern at the standard minimum of 95% confidence any dependence of win percentage on the three payroll variables.

Interestingly enough, when I perform the analysis using the same 2003 data, but formulating the problem as three separate one-variable single regressions (instead of one comprehensive three-variable multiple regression as employed in the study), I arrive at t-statistics of 2.93 for dependence of win percentage on starting pitchers' salaries, 2.77 for dependence on fielders' salaries, and 1.49 for dependence on closing pitcher's salary--all higher than the t-statistics for the multiple regression given above. Further, if I combine starting pitchers', fielders' and closing pitcher's salaries into a single variable (i.e., P+F+C) and again run a one-variable regression, I find an even higher t-statistic, namely, 3.49.

In other words, by "zooming out" and viewing the data using an effectively lower resolution microscope, we actually find a more robust statistical pattern--this is reminiscent of the proverbial necessity of stepping back from the individual trees in order to view the grander forest. But, you might be wondering, how can this be? How is it possible in a regression to see a pattern at a lower resolution that essentially disappears at a higher resolution?

To understand the mechanism behind this paradoxical statistical behavior, consider a very simple regression example. Suppose we are trying to understand the relationship between a dependent variable, z, and two independent variables, x and y, based on five data points:

Data point 1: x = 1, y = 1 and z = 1

Data point 2: x = 2, y = 2 and z = 2

Data point 3: x = 3, y = 3 and z = 3

Data point 4: x = 4, y = 5 and z = 4

Data point 5: x = 5, y = 4 and z = 4.

Graphically, three plots are relevant:

a) Multiple Regression: Three-dimensional plot of x and y versus z,

b) Single Regression: Two-dimensional plot of x versus z (same as y versus z), and

c) Single Regression: Two-dimensional plot of combined variable, x+y, versus z.

In the multiple regression, the t-statistics are 3.3 for each of x and y. Observe the "dispersion" of data points 4 and 5 in the three-dimensional plot, with each of these points offset in a different direction from the straight line that can be drawn through data points 1, 2 and 3. This dispersion adds extra error to the regression, creating a relatively poor regression fit to the data.

In the single regression of x versus z (or, symetrically, y versus z), four of the five data points are collinear, and only the fifth data point introduces error into the otherwise perfect linear fit. This tighter fit of the data to a straight line yields a t-statistic of 6.9, higher than in the multiple regression case.

Still better yet, if we regress on the

*combined*variable, x+y, we end up with a t-statistic of 17.9, substantially higher than in either of the other cases. By combining x and y into a single variable, we eliminate the oppositely directed "dispersive meandering" of x and y. The combined variable allows the regression analysis to reveal a closer correspondence between the independent variable (x+y) and the dependent variable (z).**Back to Baseball . . . and a Lesson**In an analogous way, the baseball statistics study relying on multiple regression produces a poorer picture of the relationships between variables than does the single regression. Behind the scenes is probably a mechanism akin to the following: Owners and managers of a given baseball team work within budget constraints during any particular season, so that the total amount of money available to pay all players on the team may be viewed effectively as a fixed quantity for that year. If more money is spent paying starting pitchers, then less money is available to hire and pay fielders and closers. Similar to how in the simple example above, x is less than y at data point 4, but y is less than x at data point 5, a particular baseball team may decide to spend less of its budget on

*starting pitchers*than fielders, while another team may decide to flip the allocation the other way around, with less of its budget going to fielders than starting pitchers.When the salaries of the all pitchers and fielders are combined, a more meaningful variable results against which to regress the win percentages. For this reason, the single regression using the combined salaries produces a higher t-statistic and better fit to the linear regression model.

The basic lesson here is that, when analyzing problems, it helps always to look for simpler relationships, explanations and solutions first, before implementing more sophisticated analytical tools. In working with scientific, financial, economic, sports or any other type of data, we are often warned against fabricating false patterns (artifacts of the analysis) by overfitting data to a model. In a similar vein, our discussion shows how it is also sometimes possible to overlook robust patterns by forcing an overly complicated model onto an intrinsically simpler set of data.