This paper was huge fun to write. It addresses a problem I've contemplated since my PhD. How do instantaneous rate models of insertion-deletion processes relate to distributions over pairwise sequence alignments? Both look like HMMs. The only exactly-solved model is TKF91... https://twitter.com/biorxivpreprint/status/1278029629685665792
The TKF91 model involves single-residue indels and can be solved as a linear birth-death process. All well & good, but single-residue indels are unrealistic and produce crappy alignments. We know from data that real sequences evolve via longer indel events https://pubmed.ncbi.nlm.nih.gov/1920447/
There've been a few attempts to solve long indel models. In 2004 Istvan Miklós, @GertonLunter, and I tried enumerating indel trajectories directly, up to 3 events. It gives an exact lower bound but is slow, can only do 30-40 bases & fails for longer times: https://pubmed.ncbi.nlm.nih.gov/14694074/
Most other attempts at solving this process have involved introducing artificial assumptions, like indivisible fragment boundaries in the sequence, that are hard to connect systematically to the underlying continuous-time process. Such as the TKF92 model: https://pubmed.ncbi.nlm.nih.gov/1556741/
This paper by Elena Rivas and Sean Eddy is another that uses this approach and guesses at closed form solutions. These approximations are analytically tractable, but they introduce additional assumptions & aren't really justified from first principles: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0832-5
Recently @Nicola_De_Maio developed a breakthrough approach, developing ODEs for how the mean indel length evolves. His paper is in press but he has talked about it publicly including at ProbGen 2019. I was 100% inspired by this work, Nicola deserves credit https://probgen2019.sciencesconf.org/data/pages/ProbGen2019_abstracts_book.pdf
My approach is to represent the alignment distribution as an HMM 𝔽(t) and the infinitesimal generator as another HMM 𝔾(Δt). Each has 3 states. Multiplying them together (using automata algebra) gives you a 9-state HMM 𝔽(t)𝔾(Δt) which we map back to 𝔽(t+Δt) by coarse-graining
Taking the infinitesimal limit Δt→0 leads to differential equations for the transition probabilities of the approximating HMM 𝔽(t). These ODEs can be solved numerically... for the long indel case. For the special case of single-residue indels, TKF91 emerges as an exact solution
I show with simulations that this is a better fit than Nicola's model (which was already quite good) & often beats our 2004 approximation too. I also conjecture a parallel set of ODEs for sufficient statistics to use Expectation Maximization to train the rate parameters.
This is a problem I've speculated about for 25 years and a personal itch I'm satisfied to scratch. I think this approach (multiplying an Pair HMM with an infinitesimal generator, then matching transition probabilities by renormalizing) might be usable for more complex models
All the code implementing these methods, including the simulation benchmark and a clean implementation of the Miklós et al 2004 algorithm for calculating trajectory likelihoods in a continuous-time Markov chain, is available at https://github.com/ihh/trajectory-likelihood