Data from: Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimationLink copied to clipboard!
-
- Description:
- Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with k classes, our implementation of PMSF in IQ-TREE (http://www.iqtree.org) speeds up the computation by approximately k /1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.
Usage Notes:simuLBA.C20F.fourtaxa.tar: 4taxa 20K sites
The sequence data (20K sites) were simulated under LG+C20+F+G for four-taxon trees under the LBA-inducing conditions. The tree files are also included.simuLBA.C20F.fourtaxa.tar.gzsimuLBA.C60F.fourtaxa.tar: 4 taxa 20K sites
The sequence data (20K sites) were simulated under LG+C60+F+G for four-taxon trees under the LBA-inducing conditions. The tree files are also included.simuLBA.C60F.fourtaxa.tar.gzsimuLBR.C20F.fourtaxa.tar: 4 taxa 20K sites
The sequence data were simulated under LG+C20+F+G for four-taxon trees under the LBR-inducing conditions. The tree files are also included.simuLBR.C20F.fourtaxa.tar.gzsimuLBR.C60F.fourtaxa.tar: 4 taxa 20K sites
The sequence data were simulated under LG+C60+F+G for four-taxon trees under the LBR-inducing conditions. The tree files are also included.simuLBR.C60F.fourtaxa.tar.gzsimuLBR.8taxa.tre.seq
The sequence data were simulated under LG+C20+F+G for an 8-taxon tree under an LBR-inducing condition. The tree file is also included.simuLBR.12taxa.tre.seq
The sequence data were simulated under LG+C20+F+G for a 12-taxon tree under an LBR-inducing condition. The tree file is also included.simuLBR.16taxa.tre.seq
The sequence data were simulated under LG+C20+F+G for a 16-taxon tree under an LBR-inducing condition. The tree file is also included.simuLBR.20taxa.tre.seq
The sequence data were simulated under LG+C20+F+G for a 20-taxon tree under an LBR-inducing condition. The tree file is also included.Supplementary Materials: main file
PMSF.Sup.Materials.pdfSupplementary Materials: file 2
PMSF.Sup.Materials.2.pdfsimuLBA.LGFG.tar
Simulation under LG+F+G for 4 taxa 1000 sites under LBA setting.simuLBR.LGFG.tar
Simulation under LG+F+G for 4 taxa 1000 sites under LBR setting.simuLBR.EXEHO.tar
Simulation under EX_EHO for 4 taxa 6000 sites under LBR setting.simuLBA.EXEHO.tar
Simulation under EX_EHO for 4 taxa 6000 sites under LBA setting.simu.amborella.JTT.tar
Simulate under JTT+F+G based on an Amborella/Angiosperm tree for 12549 sites; used for Fig. S25.simu.Ord0245.tar
Bootstrap alignment files for fitting PMSF based on the ML tree estimated from Ord0245, one of the 300 proteins in the HSSP test datasets. These data were used for producing Fig. S1.simuLBA.C20F.1050sites.tar
Simulation under LG+C20+F+G for 4 taxa 1050 sites under LBA setting.simuLBR.C20F.1050sites.tar
Simulation under LG+C20+F+G for 4 taxa 1050 sites under LBR setting. -
- Author(s):
- Wang, Huai-Chun, Minh, Bui QuangDalhousie University, Susko, EdwardUniversity of Vienna, and Roger, Andrew J.Dalhousie UniversityDalhousie University
-
- Source Repository:
- Dryad
- Publisher(s):
- Dryad
-
- Access:
- Public
-
- URL:
- http://datadryad.org/stash/dataset/doi:10.5061/dryad.gv1q5
-
- Publication date:
- 2017-08-04
-
- Keywords:
-
- Identifier:
- https://doi.org/10.5061/dryad.gv1q5
Geospatial information
There is no geographic information available for this record.
Select geospatial feature(s) to view on map:
Citation
-
- APA Citation:
-
Wang, H.-C., Minh, B. Q., Susko, E., & Roger, A. J. (2017). Data from: Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation [Data set]. Dryad. http://datadryad.org/stash/dataset/doi:10.5061/dryad.gv1q5Citation copied to clipboard
-
- Export to citation manager:
-