That day has come, courtesy of Baseball Prospectus. Once the home of cutting-edge statistical analysis, they now misinterpret basic statistics to draw inaccurate conclusions.
Here are the article's major points:
- For pitchers with a large discrepancy between FIP and ERA in the first half of the season, the correlation coefficient (r-value) between first-half ERA and second-half ERA is .33, whereas the r-value between first-half FIP and second-half ERA is .35. Thus, ERA is "equally as likely" to indicate performance going forward.
It's true that there is little difference between .33 and .35. However, this statement alone means nothing. Let's look at a hypothetical group of pitchers:
|1H ERA||1H FIP||2H ERA|
The correlation between first-half ERA and second-half ERA is a perfect 1, whereas the correlation between first-half FIP and second-half ERA is a completely imperfect -1: a lower FIP actually indicates a higher ERA going forward!
Does this mean ERA is a better predictor than FIP? Of course not. Anyone can look at the above numbers and see that 2H ERA matches up very well with 1H FIP and not at all with 1H ERA. Yes, this example was contrived, but the same effect is at work with the real numbers. The lesson: Don't believe everything an r-value tells you.
- Pitchers with a discrepancy between their 1H FIP and ERA, as a group, had a 3.34 1H ERA, a 4.64 1H FIP, and a 4.60 2H ERA. This compares to a control group with a 4.40 1H ERA, 4.34 1H FIP, and 4.35 2H ERA.
Now, you might think that this means FIP is way, way better than ERA at predicting future performance. But wait...
- The 2H ERA sample has a higher standard deviation (1.42) than the 1H ERA (0.83) and the 1H/2H FIPs. This explains everything!
It explains nothing. I can't believe I have to point this out, but as the average ERA of a group increases, the standard deviation of ERAs within the group tends to increase with it.
Seidman reminds us that this is the "SAME group" of pitchers. So let's do it his way and make two groups of the SAME pitchers: Group A is every starter's ten best starts from 2008, and Group B is every starter's ten worst starts. Naturally there is going to be a huge discrepancy in group ERA--it might be something like 1.50 for Group A and 8.00 for Group B.
What about the standard deviations for the groups? Should we expect them to be equal, since these are the SAME pitchers? Of course not. Group A is going to contain a lot of ERAs between 1.00 and 2.00, while Group B will be spread more thinly between 6.00 and 11.00.
Similarly, we simply cannot expect a group with a 4.60 ERA to have the same standard deviation as a group with a 3.34 ERA, even if it is the SAME guys. (Okay, I'll stop with the caps now.)
What about the 2H ERA having a higher standard deviation than either FIP sample? ERA naturally has a higher standard deviation than FIP, because FIP has much of ERA's variance stripped from it. The reason 1H ERA has a similar standard deviation to the FIP samples is that the average 1H ERA is much lower than either group's FIP, reducing the standard deviation as we saw above.
Mason Malmuth once wrote that the real handicap of a bad poker book is that the reader cannot distinguish between good advice and bad, and as a result will develop bad habits without knowing it. If BP doesn't screen its content better than this, it's going to suffer from the same problem.