This paper was huge fun to write. It addresses a problem I've contemplated since my PhD. How do instantaneous rate models of insertion-deletion processes relate to distributions over pairwise sequence alignments? Both look like HMMs. The only exactly-solved model is TKF91... https://twitter.com/biorxivpreprint/status/1278029629685665792
The TKF91 model involves single-residue indels and can be solved as a linear birth-death process. All well & good, but single-residue indels are unrealistic and produce crappy alignments. We know from data that real sequences evolve via longer indel events https://pubmed.ncbi.nlm.nih.gov/1920447/ 
Most other attempts at solving this process have involved introducing artificial assumptions, like indivisible fragment boundaries in the sequence, that are hard to connect systematically to the underlying continuous-time process. Such as the TKF92 model: https://pubmed.ncbi.nlm.nih.gov/1556741/ 
Recently @Nicola_De_Maio developed a breakthrough approach, developing ODEs for how the mean indel length evolves. His paper is in press but he has talked about it publicly including at ProbGen 2019. I was 100% inspired by this work, Nicola deserves credit https://probgen2019.sciencesconf.org/data/pages/ProbGen2019_abstracts_book.pdf
My approach is to represent the alignment distribution as an HMM 𝔽(t) and the infinitesimal generator as another HMM 𝔾(Δt). Each has 3 states. Multiplying them together (using automata algebra) gives you a 9-state HMM 𝔽(t)𝔾(Δt) which we map back to 𝔽(t+Δt) by coarse-graining
Taking the infinitesimal limit Δt→0 leads to differential equations for the transition probabilities of the approximating HMM 𝔽(t). These ODEs can be solved numerically... for the long indel case. For the special case of single-residue indels, TKF91 emerges as an exact solution
I show with simulations that this is a better fit than Nicola's model (which was already quite good) & often beats our 2004 approximation too. I also conjecture a parallel set of ODEs for sufficient statistics to use Expectation Maximization to train the rate parameters.
This is a problem I've speculated about for 25 years and a personal itch I'm satisfied to scratch. I think this approach (multiplying an Pair HMM with an infinitesimal generator, then matching transition probabilities by renormalizing) might be usable for more complex models
All the code implementing these methods, including the simulation benchmark and a clean implementation of the Miklós et al 2004 algorithm for calculating trajectory likelihoods in a continuous-time Markov chain, is available at https://github.com/ihh/trajectory-likelihood
You can follow @ianholmes.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.