Thread by @andrewjroger, Long branch attraction explained! Ed Susko and I have been investigating the [...]

Long branch attraction explained! Ed Susko and I have been investigating the reasons why long branches tend to attract in phylogenetic analyses. This paper provides new insights into why this occurs even when the phylogenetic model is 'correct' (thread) http://tinyurl.com/kybmkvw7

Long Branch Attraction Biases in Phylogenetics

Abstract. Long branch attraction is a prevalent form of bias in phylogenetic estimation but the reasons for it are only partially understood. We argue here that

http://tinyurl.com/kybmkvw7

We've known about the long branch attraction (LBA) artefact in phylogenetics since Felsenstein's original paper in 1978. However, for maximum likelhood and Bayesian methods LBA was thought to be a concern mainly in cases where the model doesn't fit the data adequately.

LBA has proved to be a major concern for larger-scale phylogenomic analyses. However, early simulation studies (e.g. Huelsenbeck 1995) showed that LBA also occurred when the model of evolution was correct but the numbers of sites were not large

The reasons for why LBA occurred under a correct model were unclear. Furthermore, it was unclear why is LBA was such a common kind of artefact in phylogenetics.

In this paper, we show that LBA is a common bias in tree estimation because trees with long branches together are more flexible models than alternative trees where long branches are apart.

In other words, the model 'space' is larger for the LBA tree than its alternatives and so, with small samples, it has a higher chance to be selected as the ML tree.

This LBA bias under the correct model can cause problems in concatenated phylogenomic analyses if the data are partitioned by genes, each with its own branchlength parameters. Here the small sample bias compounds into a strongly misleading LBA 'signal' http://tinyurl.com/4hawm8d7

Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in...

Abstract. Large taxa-rich genome-scale data sets are often necessary for resolving ancient phylogenetic relationships. But accurate phylogenetic inference requi

http://tinyurl.com/4hawm8d7

A similar LBA problem is expected to occur for 'coalescent aware' methods (e.g. ASTRAL) that utilize individual gene/protein tree estimates to get the final topology as shown in: http://tinyurl.com/4m2ou8qm

Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and...

Abstract. With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust

http://tinyurl.com/4m2ou8qm

An important lesson here is that model complexity isn't just about the number of estimated parameters in the model. Model complexity is about how 'flexible' it is. LBA trees are frequently estimated because they are more flexible models than non-LBA trees.
(end thread)

Note: most of the conceptual and mathematical/statistical 'heavy-lifting' in this paper was done by the ever-brilliant Ed Susko: https://www.dal.ca/faculty/science/math-stats/faculty-staff/our-faculty/statistics/ed_susko.html

Edward Susko

https://www.dal.ca/faculty/science/math-stats/faculty-staff/our-faculty/statistics/ed_susko.html

Latest Threads Unrolled: