Long branch attraction explained! Ed Susko and I have been investigating the reasons why long branches tend to attract in phylogenetic analyses. This paper provides new insights into why this occurs even when the phylogenetic model is 'correct' (thread) http://tinyurl.com/kybmkvw7
We've known about the long branch attraction (LBA) artefact in phylogenetics since Felsenstein's original paper in 1978. However, for maximum likelhood and Bayesian methods LBA was thought to be a concern mainly in cases where the model doesn't fit the data adequately.
LBA has proved to be a major concern for larger-scale phylogenomic analyses. However, early simulation studies (e.g. Huelsenbeck 1995) showed that LBA also occurred when the model of evolution was correct but the numbers of sites were not large
The reasons for why LBA occurred under a correct model were unclear. Furthermore, it was unclear why is LBA was such a common kind of artefact in phylogenetics.
In this paper, we show that LBA is a common bias in tree estimation because trees with long branches together are more flexible models than alternative trees where long branches are apart.
In other words, the model 'space' is larger for the LBA tree than its alternatives and so, with small samples, it has a higher chance to be selected as the ML tree.
This LBA bias under the correct model can cause problems in concatenated phylogenomic analyses if the data are partitioned by genes, each with its own branchlength parameters. Here the small sample bias compounds into a strongly misleading LBA 'signal' http://tinyurl.com/4hawm8d7
A similar LBA problem is expected to occur for 'coalescent aware' methods (e.g. ASTRAL) that utilize individual gene/protein tree estimates to get the final topology as shown in: http://tinyurl.com/4m2ou8qm
An important lesson here is that model complexity isn't just about the number of estimated parameters in the model. Model complexity is about how 'flexible' it is. LBA trees are frequently estimated because they are more flexible models than non-LBA trees.
(end thread)
(end thread)
Note: most of the conceptual and mathematical/statistical 'heavy-lifting' in this paper was done by the ever-brilliant Ed Susko: https://www.dal.ca/faculty/science/math-stats/faculty-staff/our-faculty/statistics/ed_susko.html