hello! on april 11th 2017, @jon_bois uploaded a video called "What if Barry Bonds had played without a baseball bat? | Chart Party." It's a great video, but it has stats issue -- one i've decided to address in this thread.
firstly, if you haven't watched the video, you should watch it before reading further.

second, i call this a "stats issue," but it's really not a flaw in the video. it's really not. it is an excellent story told with admirable rigor.

HOWEVER, i am me. so let's continue.
the video concerns the single-season On Base Percentage record. OBP is, roughly, how often you get on base. (roughly, because, baseball.) Here are three hundred best single-season OBP numbers an MLB player has reached since 1920. In bold: Barry Bonds.
Here's just the top of that chart. In red is the single-season OBP record: 2004 Barry Bonds, .6094. Below it in blue is the second-best: 2002 Barry Bonds, .5817. And there, in orange, is Jon's simulated BBw/oaB 2004: .6078.
The "point" of Jon's video is that, according to his simulation, 2004 Barry Bonds without a Bat would still beat out 2002 Barry Bonds for the all-time single-season OBP record.

But at the end if the video, Jon wonders if his simulation has a flaw. And it does.
Before we continue
-- again, you should really watch the video to (re)familiarize yourself with the constraints that Jon's (and my) simulation use
-- baseballreference's OBP data is given as a four-digit decimal, which i use as well, but multiplied by 1000 for readability.
ANYWAY: as many commenters on the video have noted, Jon uses a monte carlo simulation of BB's 2004, where pitches are resimulated depending on whether contact was involved. The problem: Jon only did one simulation.
I first learned about monte carlo simulations as a way to throw hot dogs at a circle to approximate pi. They're real cool. But you have to do them a lot of times. Like, at least hundreds. Preferably thousands.
Just one simulation is a data point, but in the monte carlo method, you take the *average* of all your data points, and *that's* your result. Which raises some questions:
what if Jon's result was an outlier?

what if 2004 BBw/oaB wasn't as great as we thought?

what if 2004 BBw/oaB wasn't (excluding 2004 BBw/aB) the greatest single-season OBP of all time?

i set out to answer this question.
so i did my duty: downloaded all of 2004 baseball as a 380k line text file; lead-fistedly found all the lines with barry bonds in them; parsed those lines into at-bats; separated them into walks, strikeouts, and balls in play; wrote some "code" to simulate pitches; and then, sim.
(you will later have an opportunity to fully understand how absolutely cork-brained it is to do this it is without scripting language, which i did, out of bull-hearted weakness or weak-hearted bullishness, i can't tell. really going for the object-bodypart metaphors this morning)
The result? Jon's result was 1.5 standard deviations above the mean: a higher-than normal result.

But it didn't matter.

2004 Barry Bonds without a Bat took home an OBP of 586.6 -- less than 5 points ahead of 2002 Barry Bonds -- to take home the all-time OBP crown.
Here's some more details on those results, including the final all-time chart, a fit-curve of a normal distribution that looks okay, a results timeline, and a little table with more info. It worked!

Also, i forgot the most important part of the result: confidence interval 99.9%.
You can follow @lyskoi.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.