Thread by @ianholmes, This paper was huge fun to write. It addresses a problem I've [...]

This paper was huge fun to write. It addresses a problem I've contemplated since my PhD. How do instantaneous rate models of insertion-deletion processes relate to distributions over pairwise sequence alignments? Both look like HMMs. The only exactly-solved model is TKF91... https://twitter.com/biorxivpreprint/status/1278029629685665792

https://twitter.com/biorxivpreprint/status/1278029629685665792

The TKF91 model involves single-residue indels and can be solved as a linear birth-death process. All well & good, but single-residue indels are unrealistic and produce crappy alignments. We know from data that real sequences evolve via longer indel events https://pubmed.ncbi.nlm.nih.gov/1920447/

An Evolutionary Model for Maximum Likelihood Alignment of DNA Sequences - PubMed

Most algorithms for the alignment of biological sequences are not derived from an evolutionary model. Consequently, these alignment algorithms lack a strong statistical basis. A maximum likelihood...

https://pubmed.ncbi.nlm.nih.gov/1920447/

There've been a few attempts to solve long indel models. In 2004 Istvan Miklós, @GertonLunter, and I tried enumerating indel trajectories directly, up to 3 events. It gives an exact lower bound but is slow, can only do 30-40 bases & fails for longer times: https://pubmed.ncbi.nlm.nih.gov/14694074/

A "Long Indel" Model for Evolutionary Sequence Alignment - PubMed

We present a new probabilistic model of sequence evolution, allowing indels of arbitrary length, and give sequence alignment algorithms for our model. Previously implemented evolutionary models have...

https://pubmed.ncbi.nlm.nih.gov/14694074/

Most other attempts at solving this process have involved introducing artificial assumptions, like indivisible fragment boundaries in the sequence, that are hard to connect systematically to the underlying continuous-time process. Such as the TKF92 model: https://pubmed.ncbi.nlm.nih.gov/1556741/

Inching Toward Reality: An Improved Likelihood Model of Sequence Evolution - PubMed

Our previous evolutionary model is generalized to permit approximate treatment of multiple-base insertions and deletions as well as regional heterogeneity of substitution rates. Parameter estimation...

https://pubmed.ncbi.nlm.nih.gov/1556741/

This paper by Elena Rivas and Sean Eddy is another that uses this approach and guesses at closed form solutions. These approximations are analytically tractable, but they introduce additional assumptions & aren't really justified from first principles: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0832-5

Parameterizing sequence alignment with an explicit evolutionary model

Inference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. However, the insertion and deletion penalties in the most widely used methods for...

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0832-5

Recently @Nicola_De_Maio developed a breakthrough approach, developing ODEs for how the mean indel length evolves. His paper is in press but he has talked about it publicly including at ProbGen 2019. I was 100% inspired by this work, Nicola deserves credit https://probgen2019.sciencesconf.org/data/pages/ProbGen2019_abstracts_book.pdf

My approach is to represent the alignment distribution as an HMM 𝔽(t) and the infinitesimal generator as another HMM 𝔾(Δt). Each has 3 states. Multiplying them together (using automata algebra) gives you a 9-state HMM 𝔽(t)𝔾(Δt) which we map back to 𝔽(t+Δt) by coarse-graining

Taking the infinitesimal limit Δt→0 leads to differential equations for the transition probabilities of the approximating HMM 𝔽(t). These ODEs can be solved numerically... for the long indel case. For the special case of single-residue indels, TKF91 emerges as an exact solution

I show with simulations that this is a better fit than Nicola's model (which was already quite good) & often beats our 2004 approximation too. I also conjecture a parallel set of ODEs for sufficient statistics to use Expectation Maximization to train the rate parameters.

This is a problem I've speculated about for 25 years and a personal itch I'm satisfied to scratch. I think this approach (multiplying an Pair HMM with an infinitesimal generator, then matching transition probabilities by renormalizing) might be usable for more complex models

All the code implementing these methods, including the simulation benchmark and a clean implementation of the Miklós et al 2004 algorithm for calculating trajectory likelihoods in a continuous-time Markov chain, is available at https://github.com/ihh/trajectory-likelihood

ihh/trajectory-likelihood

Calculate indel trajectory likelihoods using method of Miklós, Lunter & Holmes (2004) - ihh/trajectory-likelihood

https://github.com/ihh/trajectory-likelihood

Latest Threads Unrolled: