In science, people tend to be most interested in positive results — a manipulation changes what you are measuring, two groups differ in meaningful ways, a drug treatment works, that sort of thing.
Journals preferentially publish positive results that are statistically significant — they would be unlikely to have arisen by chance if there wasn't something going on.

Negative results, meanwhile, are uncommon.
Knowing that journals are unlikely to publish negative results, scientists don't bother to write them up and submit them. Instead they up buried file drawers—or these days, file systems.

This is known as the file drawer effect.

(Here p<0.05 indicates statistical significance.)
I was taken by a figure that @urbancic sent my way today, from a paper posted recently on the arXiv. https://arxiv.org/abs/2009.09440 

These are the values of something called "z values" from over a million biomedical research papers.

What a weird distribution. Let's look a bit closer.
Without going into a lot of detail, we can view these scores as a measure of statistical support for a positive result. Values near zero indicate little or no support; values greater than 2 or so indicate statistical significance according to conventional thresholds (p<0.05).
We can reasonably conclude from this that there are a lot of studies sitting in file drawers. If everything was published, positive or negative, you might expect to see something rather like this. The shaded area represents the missing studies.
So what?

A bunch of boring stuff that didn't work didn't get published.

Who cares?
The problem is, these missing results can bias our view of what works and what doesn't.

If, in reading the literature, we only see the successes and not the failures, we may be drawn to incorrect conclusions about important scientific questions.
In one of my favorite studies, @eturnermd1 and colleagues looked at this phenomenon for studies of antidepressant efficacy.

Before I go on, I believe that antidepressants can work well for severe depression. Not all the time, and not without some tinkering. But they save lives.
Turner and his colleagues looked at what you would see if you went to the medical literature to look at clinical trials of antidepressants.

Studies above the line represent studies that found statistically significant benefits. Below the line, no benefits.
Looks great, right?

But the problem is, you're missing the studies that showed no result.

Erick was able to get access to these studies through the FDA regulatory process.

Adding those in, you get a really different picture.
I liken this to an iceberg. You normally see only the part above the waterline, but it can be a deadly mistake to assume there's nothing beneath.

(Photo: Wikimedia Commons)

BTW, here's the study in question: https://www.nejm.org/doi/full/10.1056/nejmsa065779
What happened to all those missing trials below the waterline? Many of them—the ones shown in yellow below— simply didn't result in publications. They ended up in file drawers, so to speak.
What is perhaps more remarkable is what happened to other trials below the waterline. By "outcome shifting" — changing the success criteria that one is looking for after the results come come — the studies shown in blue were reframed as positive results and published.
None of this is to say that science is broken, corrupt, or anything like that. There are legitimate reasons not to fill the journals with reports of things that didn't work.

People are thinking hard where we can be misled by these missing results—and what we can do about it.
In my view, this is an important area in what we call the "Science of Science", "Metascience", or "Metareseach."

When the pandemic relaxes its grip on my research attention, I look forward to returning to this area.
Addendum: A bit more technical, but important note about the z value figure. @Lakens points out that these data are mined from the literature. People may be reporting, but not quantifying, the negative results. That's true, at least in part. https://twitter.com/lakens/status/1335122795257327616
But note that these z values are computed from confidence intervals (and other related data), not reported directly in the form of p or z values.

So I wouldn't expect the same selection bias in terms of what is reported quantitatively within a paper.
There's also the issue of what hypotheses get tested in the first place. Chances are, most people don't spend their time testing things they expect won't work. So if researchers have a good intuition about the results will be, we don't expect many tests of hypothesis around z=0.
Maybe if everything were published, we would see something more like this.

Maybe. I'm not convinced, but it's not a trivial issue I think, and I need to give it more careful thought.

[updated to be asymmetrical, per several useful comments]
For me the real take-home is that the "walls of the volcano" are extremely steep, even though the data were inferred from confidence intervals rather collected as directly reported z scores or p values.
I wouldn't fit this to a Gaussian and use that to estimate the exact magnitude of publication bias. But as a general illustration of the principles underlying publication bias in science, I think it's powerful.
You can follow @callin_bull.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.