Thread by @stalebread14, I've uncovered that DC calculates 7-day average positivity in an incredibly dumb [...]

seven costanza

stalebread14

I've uncovered that DC calculates 7-day average positivity in an incredibly dumb way, which has led to an overestimation in the positivity rate by an average of 0.35% [in my preliminary analysis] during the pandemic

I think the easiest way to explain the mistake is through a baseball analogy (sorry non-sports people)

Batting average is calculated as (# of hits)/(# of at bats), so if during a week you have 5 hits in 10 at bats, you should be batting .500 (50%)

Let's say you play 4 games that week. Your hits/at bats are:
- 1/3
- 1/3
- 2/3
- 1/1
So in total for that week, you've batted 5/10. Note the last game you batted 1.000 (100%), but that one-day eye-popping stat won't affect your weekly average, which is 5/10 (.500)

Now imagine you averaged *each individual day's* batting average. The result would be:
(.333+.333+.666+1.0)/4 games= 0.583

That doesn't seem right! You only batted once on the last game, it's really impacting the week average!

See where I'm going with this?

DC Is doing the latter calculation, an average of each individual day's positivity rate rather than (# of new cases in a week)/(# of new tests in a week).

Testing is lower on the weekends, which leads to higher one day positive rates sometimes, which impacts their calculation

To be fair, DC does say that their measure is a "7-Day Average", which is technically true.

But does that seem like a good metric for positivity rates? I don't think it takes too much knowledge of statistics or epidemiology to say "probably not"

The next question you may have is: How did I figure this out?

You can actually download the data behind the positivity chart by clicking on the x-axis of their chart

These unhelpful variable names seem somewhat obscure, but you can begin to see that DC has a database with rows for each test, and the "Number of Rows (Aggregated)" correlates to the # of tests done on that day, and "sample_results" is the % of those tests that are positive

You can sanity check that assumption and see that (with small rounding error) the implied positive cases from that day's testing batch ends up being an integer

You can further verify this by creating a "Literal 7-day average" positivity stat, which gives you the exact same 5.413% stat for 11/23 (and matches exactly with the rest of DC's data too)

From there, it's simple to calculate a rolling window positivity, which is what I based my "overestimation" on.

As I said, DC isn't doing something *wrong* to get a "7-day average", it just isn't a useful measurement for the statistic.

I hope what I've made clear is that you need a rolling window of tests/cases to get a straightforward measure of positivity in a week

Lastly, check out my website, http://dccovid.com which visualizes DC's raw case/testing data and *doesn't* use misleading statistics!

You can follow @stalebread14.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: