hello! on april 11th 2017, @jon_bois uploaded a video called "What if Barry Bonds had played without a baseball bat? | Chart Party." It's a great video, but it has stats issue -- one i've decided to address in this thread.
firstly, if you haven't watched the video, you should watch it before reading further.
second, i call this a "stats issue," but it's really not a flaw in the video. it's really not. it is an excellent story told with admirable rigor.
HOWEVER, i am me. so let's continue.
second, i call this a "stats issue," but it's really not a flaw in the video. it's really not. it is an excellent story told with admirable rigor.
HOWEVER, i am me. so let's continue.
the video concerns the single-season On Base Percentage record. OBP is, roughly, how often you get on base. (roughly, because, baseball.) Here are three hundred best single-season OBP numbers an MLB player has reached since 1920. In bold: Barry Bonds.
Here's just the top of that chart. In red is the single-season OBP record: 2004 Barry Bonds, .6094. Below it in blue is the second-best: 2002 Barry Bonds, .5817. And there, in orange, is Jon's simulated BBw/oaB 2004: .6078.
The "point" of Jon's video is that, according to his simulation, 2004 Barry Bonds without a Bat would still beat out 2002 Barry Bonds for the all-time single-season OBP record.
But at the end if the video, Jon wonders if his simulation has a flaw. And it does.
But at the end if the video, Jon wonders if his simulation has a flaw. And it does.
Before we continue
-- again, you should really watch the video to (re)familiarize yourself with the constraints that Jon's (and my) simulation use
-- baseballreference's OBP data is given as a four-digit decimal, which i use as well, but multiplied by 1000 for readability.
-- again, you should really watch the video to (re)familiarize yourself with the constraints that Jon's (and my) simulation use
-- baseballreference's OBP data is given as a four-digit decimal, which i use as well, but multiplied by 1000 for readability.
ANYWAY: as many commenters on the video have noted, Jon uses a monte carlo simulation of BB's 2004, where pitches are resimulated depending on whether contact was involved. The problem: Jon only did one simulation.
I first learned about monte carlo simulations as a way to throw hot dogs at a circle to approximate pi. They're real cool. But you have to do them a lot of times. Like, at least hundreds. Preferably thousands.
Just one simulation is a data point, but in the monte carlo method, you take the *average* of all your data points, and *that's* your result. Which raises some questions:
what if Jon's result was an outlier?
what if 2004 BBw/oaB wasn't as great as we thought?
what if 2004 BBw/oaB wasn't (excluding 2004 BBw/aB) the greatest single-season OBP of all time?
i set out to answer this question.
what if 2004 BBw/oaB wasn't as great as we thought?
what if 2004 BBw/oaB wasn't (excluding 2004 BBw/aB) the greatest single-season OBP of all time?
i set out to answer this question.
so i did my duty: downloaded all of 2004 baseball as a 380k line text file; lead-fistedly found all the lines with barry bonds in them; parsed those lines into at-bats; separated them into walks, strikeouts, and balls in play; wrote some "code" to simulate pitches; and then, sim.
(you will later have an opportunity to fully understand how absolutely cork-brained it is to do this it is without scripting language, which i did, out of bull-hearted weakness or weak-hearted bullishness, i can't tell. really going for the object-bodypart metaphors this morning)
The result? Jon's result was 1.5 standard deviations above the mean: a higher-than normal result.
But it didn't matter.
2004 Barry Bonds without a Bat took home an OBP of 586.6 -- less than 5 points ahead of 2002 Barry Bonds -- to take home the all-time OBP crown.
But it didn't matter.
2004 Barry Bonds without a Bat took home an OBP of 586.6 -- less than 5 points ahead of 2002 Barry Bonds -- to take home the all-time OBP crown.
Here's some more details on those results, including the final all-time chart, a fit-curve of a normal distribution that looks okay, a results timeline, and a little table with more info. It worked!
Also, i forgot the most important part of the result: confidence interval 99.9%.
Also, i forgot the most important part of the result: confidence interval 99.9%.
Anyway, that's it. If you want to see my work, the spreadsheet's here: https://docs.google.com/spreadsheets/d/16pNgA_dVmC9Y1Pf5LE8BdodR3qphMHbjs4sHjwquObQ/edit. It includes my horrible methodology and a funny story i found in the data, so check it out.
I'd recommend following @secretbase for more video content.
That's all, thanks!
I'd recommend following @secretbase for more video content.
That's all, thanks!