Some thoughts on optional stopping methods, inspired by tweet from @Lakens https://twitter.com/lakens/status/1354427078368784388?s=20. The cast (in order of appearance): Sequential Bayes Factors (SBF), Sequential Probability Ratio Test (SPRT), and Group Sequential testing (GS).
A caveat. I am a psychiatrist. I have little formal statistical training and main research focus not methods. Im likely to be wrong and happy to be corrected. In my defense, at least Im not a surgeon (*cough* @statsepi).
Under the hood both SPRT and SBF are likelihood ratios: likelihood of the alternative hypothesis (H1)/likelihood of the null hypothesis (H0) [given the data]. Difference is that SPRT puts all eggs in one basket, with H1 as a point, while BF averages H1 over a distribution.
SPRT can therefore be thought of as a special case of SBF. Under many circumstances SBF and SPRT behave similarly with regards to efficacy and error rates (as shown by @ephemeralidea: https://twitter.com/ephemeralidea/status/1329061102819438595?s=20.
Based on this thread on sequential LR testing https://twitter.com/StatEvidence/status/1009121250525040646?s=20, I had a long crush on the SPRT approach. After lengthy discussions and simulations, @ppsigray convinced me of instead running with SBF.
Main argument (imo) against making H1 a point (ie SPRT) is poor control of rate of “false negatives” when population effect is non-zero, but smaller than chosen H1. Simplified example: if H1: d=1 and true population effect is 0.5, it becomes a coin toss if we stop for H0 or H1.
In most clinical studies this will make it difficult to apply SPRT in any other way than to decide on the smallest effect size of interest and use that as H1. Trouble is, as H1 nears H0, efficiency for SPRT drops fast.
SBF suffers from same problem, but using a distribution (centered on 0) as H1 makes things less volatile.
Also, with a larger sample the ratio of true positives/false negatives increases for BF (here non sequential, and pop effect nonzero). In the scenario with H1: d=1 and pop effect 0.5, a likelihood ratio test will be coin toss, regardless if N=10 or 10^6. That just feels wrong.
The difficulty with error control in SBF is solvable. Though no analytical approach available it is easy to get precise thresholds using simulations. Eg., see our r-script calling on saved simulations here: https://github.com/pontusps/Early_stopping_in_PET/tree/master/R/earlystopBF
Efficiency (under perfect/oracle conditions): 1. SPRT 2. SBF 3. GS. Risk of messing things up: same order (imo). SPRT is attractive, but the consequences in choice of H1 makes me nervous. Perhaps unrightly so?
You can follow @jaralaus.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.