1) Why do we run group-sequential trials in drug development?

2) How does the effect we power at relate to the effect(s) needed to stop?

3) Is stopping such a trial early (like the one for the Pfizer vaccine) "cheating"?
4) If stopping early the effect estimate may be biased. Is this an issue?

5) What happens operationally if a trial stops early?
A thread from a pharma statistician who has developed, run, analyzed, and taught courses about trials with such designs.

@statsepi @lakens @MaartenvSmeden @stevesphd @ADAlthousePhD @DominicMagirr @thomas_jaki
1) Why?

Assume time-to-event endpoint, alpha = 0.05, power = 80%, hazard ratio to detect 0.75.

Number of events needed for single-stage trial: 380. In single-stage trial you wait for this number of events IN ANY CASE, i.e. even if your initial guess of HR = 0.75 was off.
Assume you add a futility interim (stop trial if HR <= 1) after 30% of events and an efficacy interim (O'Brien-Fleming alpha-spending) after 66% of events. This increases maximal number of events needed from 380 to 408.

Interims are performed after 123 and 270 events.
Now if we run 100 such trials, some of them will actually stop at 1st or 2nd interim. Probabilities for that happening are, under H0 and H1:

futility: 0.50 / 0.06
efficacy: 0.006 / 0.43

So e.g. if the drug is useless, half of all trials will stop at the futility interim.
Stopping at interim of course means we need to collect much less events. The *expected* number of events are thus:

Under H0: 0.50 ⋅ 123 + 0.006 ⋅ 270 + 0.49 ⋅ 408 = 264.
Under H1: 0.06 ⋅ 123 + 0.43 ⋅ 270 + 0.51 ⋅ 408 = 332.
So in both cases the expected number of events is *much less* than the 380 we need to collect in any case in a single-stage design. That is the main advantage of such designs.

2) How does the effect we power at, 0.75, relate to the effect size needed to stop for efficacy?
At efficacy interim, to stop early the p-value must be ≤0.012 and for the trial to be significant at the final it must be ≤0.046. These sig levels correspond to hazard ratios of 0.735 and 0.821, respectively.

Sometimes latter are called minimal detectable differences, MDD.
Often, people believe in order to stop a trial early effect seen at interim must be *much larger* than what we assumed for powering. Comparing 0.75 to 0.735 it is clear that this is not the case. That MDD@interim and effect we power at are ~same is typical for OBF-type boundary..
... and interim after about 2/3 of info.

Another common belief is that in order to be significant at interim we need to observe hazard ratio ≤0.75. Again, not true: MDD at final analysis actually is 0.821, i.e. in order to get p-value of 0.046 or lower this is...
...the hazard ratio we need to beat.

3) "Cheating"? Methodology for group-sequential designs is developed such that familywise-error rate of *all looks at the data* is kept. This is why at the final analysis, the p-value needs to be ≤0.046, not ≤0.05.
This is the price to pay for the interim look.

But why 0.012 + 0.046 > 0.05? Isn't that cheating? No, because through exploiting the correlation between test statistics at interim and final you can "gain" a bit of alpha. Again, no cheating, FWER always protected.
4) If stopping early the effect estimate may be biased. Is this an issue?

Lots has been written about inference adjusted for the fact that trial stopped early. I'd just like to give a median unbiased estimate of the hazard ratio in our example. Assume at the futility we...
...observe HR = 0.69 and at efficacy 0.66 with *conventional* 95% CI from 0.51 to 0.85. Since 0.66≤0.735 trial stopped for efficacy.

Median unbiased estimate accounting for early stopping amounts to 0.68 with adj CI from 0.53 to 0.86. So conventional and adj analysis are close.
This is in line with literature, see

https://twitter.com/numbersman77/status/1325881963920625666?s=20

5) What happens operationally if trial stops early? Statistically, we "stopped" trial at efficacy interim and rejected H0: hazard ratio = 1 under full type I error control. We would thus proceed with filing the drug.
But of course, operationally the trial would continue: more follow-up data would be collected on 1ry and 2ry endpoints (e.g. OS), safety, biomarker, etc. data collection would also continue, typically for years.

Also, often one would still do an analysis at 408 events, the...
...initially planned final analysis, to make sure results persist over time.

Note that stopping at efficacy interim typically leads to unblinding, so analysis and interpretation of follow-up needs caution and expertise.
So group-sequential designs reduce expected number of events needed and provide valid inference in reasonable cases.

This all within the framework of hypothesis testing, as required by Health Authority guidelines.

I hope this thread is useful. Comments welcome!

The end.
You can follow @numbersman77.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.