Thread by @NoahHaber, "Problems with Evidence Assessment in COVID-19 Health Policy Impact Evaluation (PEACHPIE): A [...]

"Problems with Evidence Assessment in COVID-19 Health Policy Impact Evaluation (PEACHPIE): A systematic strength of methods review" is finally available as a pre-print!

https://doi.org/10.1101/2021.01.21.21250243

THREAD!

One of the most important questions for policy right now is knowing how well past COVID-19 policies reduced the spread and impact of SARS-CoV-2 and COVID-19.

Unfortunately, estimating the causal impact of specific policies is always hard(tm), and way harder for COVID-19.

There are LOTS of ways that these things can go wrong. Last fall, we developed review guidance and a checklist for how to "sniff test" the designs of these kinds of studies. Check that out here:

https://arxiv.org/abs/2009.01940

This study takes the guidance above, and systematically applies it to the (nearly) full peer-reviewed literature.

Our main question: How many COVID-19 policy impact evaluation papers meet basic design criteria for estimating the impact of specific policies?

What methods are used, what policies measured, etc?

Both the review guidance/tool and the process are a bit unusual, so as a secondary objective we wanted to explore how well everything worked, particularly the review tool.

We searched the PubMed literature for COVID-19 policy impact papers, screened the titles and abtracts for studies that appeared to be primarily about estimating the quantitative impact of COVID-19 policies on direct COVID-19 outcomes, published up to Nov 26.

We found 102 studies that looked like they fit the bill, and sent it on to the full article review phase.

For each article, 3 qualified reviewer/coauthors independently assessed eligibility, applied the review tool, and then had a discussion to generate a consensus opinion.

Turns out that only 36 studies were found to meet our inclusion criteria in the end. Many rejected were modelling studies and/or not directly estimating quantitative policy impact, so they didn't qualify.

First step of strength review: figure out what methods were used.

We set aside the cross-sectional and pre/post studies as being broadly inappropriate for COVID-19 policy impact evaluation. No RCTs were identified.

We then assessed the 27 remaining ITS (most common), DiD, and CITS studies for our four key design criteria:

1) Graphical representation. Showing the outcome over time is critical for determining the plausibility of assumptions and appropriateness of the methods.

Are the data even shown graphically in a way that lets us assess things properly?

2) Functional form. Infectious diseases are famously tricky to model, and simple linear projections are rately appropriate. Having an inappropriate form can wreak havoc on estimates, so it's really important to get this right. Is the functional form justified and justifiable?

3) Timing of policy impact. It takes time between the implementation of a policy and when its impact starts to show up in the data. Just like the above, getting this wrong can easily ruin an otherwise good estimate. Did the analysts address this, and how well?

4) Concurrent changes. These methods all critically rely on there being little to nothing else of interest [differentially] impacting the outcome concurrent with the policy(s) of interest.

That includes policies, social/behavioral changes, infectious disease dynamics, etc.,

No concurrent changes is a big hurdle, but it's inherent and fundamental to these methods. If there's anything substantial happening to the outcome at the same time as the policy, and it isn't can't be adjusted-for, we can't isolate the impact of the policy itself.

So, how did these studies do? Well, not great.

Most of these studies tended to do a pretty good job with the graphical representation and the timing bit, so that's good!

But when we get into functional form and concurrent changes, things take a turn.

The consensus review among the three reviewers for each study found that only 5/27 studies were found to pass checks (i.e. answering "yes" or "mostly yes" for meeting criteria), and only 3/27 for concurrent changes.

These two issues tended to be the Achilles heels.

Only one of the studies was found to have a "passing" rating for all 4 key questions.

When asked directly whether the study met overall appropriateness ratings, reviewers identified 4 studies as overall meeting our criteria.

Importantly, "passing" does NOT mean that these 4 are necessarily useful/actionable estimates; any number of other things could go wrong (other design issues, statistics, generalizability. etc) that we didn't check for.

This is just a sniff test for key design issues.

Many of the studies given low rating might also be useful for reasons other direct policy impact evaluation.

We're not declaring studies "good" or "bad" here. Just meeting a limited set of design criteria for identifying the causal impact of COVID-19 policies.

As for the review process, even with guidance, highly skilled reviewers etc, there was a LOT of disagreement between the independent reviews.

After discussion, consensus ratings were much worse than the independent ones. Problems add up, discussion reveals issues.

So, what does it all mean?

The problem here is largely that you need the right circumstances for these methods to give you rigorous estimates. We just didn't have those circumstances. Too many unknowns, and too much stuff going on all at once (i.e. concurrent changes).

As a result, almost all of these studies yielded weak inference. Many may have been nearly the best that could have been done, but the best that can be done is often not very informative given the circumstances they were working with.

Importantly, weak evidence DOES NOT imply in any way that these policies had no or weak effects, just that we lacked the circumstances needed to evaluate them!

Weak evidence just isn't informative, and strong/direct evidence is hard.

Of course, our inference is limited too! For example, we didn't look at preprints, some of which might be pretty good!

This kind of review is also inherently subjective, both in what we chose to look at and the opinions of the reviewers. Other reviewers might choose differently.

We also demonstrated here how our targetted and guided peer review process revealed a lot of weaknesses in studies that passed through the traditional peer review process, many of which are highly cited in big-name journals but rate very poorly looking for basic design criteria.

At the end of the day, what we are left with is having to face making big decisions with very little direct evidence to go on.

But that's ok! And it's ALWAYS better to know how reliable the evidence we have is for making those decisions.

This was a HUGE effort, combining the talents and time of 25 people across the world. I certaintly learned a BUNCH in the process, and am truly grateful to have had such amazing collaborators on this.

We're not done yet though.

We really and truly want your criticism, suggestions, concerns, etc. There are for sure typos, maybe even errors, things we can clarify, etc. and we'd LOVE your feedback!

Please brutally tear this paper apart and e-mail me the tattered remains (e-mail address is in the draft).

To re-emphasize here:

This study does not mean that policies were ineffective, nor does it tell us much beyond the policy literature.

What this tells us is that there are serious limitations on what is knowable given the circumstances, as discussed here.

A few more things you'll find in the full paper:

* What journals were these published in?
* How prominent/cited are these studies?
* What kinds of policies were evaluated?
* What kinds of methods were used?
* What specific studies did well/poorly in our review?

Latest Threads Unrolled: