Book contents
- Frontmatter
- Dedication
- Contents
- Preface
- Glossary
- Notation
- PART I BASIC TECHNIQUES
- PART II MOLECULAR PHYLOGENETICS
- 8 Statistical Gene Tree Estimation Methods
- 9 Multiple Sequence Alignment
- 10 Phylogenomics: Constructing Species Phylogenies from Multi-Locus Data
- 11 Designing Methods for Large-Scale Phylogeny Estimation
- Appendix A Primer on Biological Data and Evolution
- Appendix B Algorithm Design and Analysis
- Appendix C Guidelines forWriting Papers About Computational Methods
- Appendix D Projects
- References
- Index
9 - Multiple Sequence Alignment
from PART II - MOLECULAR PHYLOGENETICS
Published online by Cambridge University Press: 26 October 2017
- Frontmatter
- Dedication
- Contents
- Preface
- Glossary
- Notation
- PART I BASIC TECHNIQUES
- PART II MOLECULAR PHYLOGENETICS
- 8 Statistical Gene Tree Estimation Methods
- 9 Multiple Sequence Alignment
- 10 Phylogenomics: Constructing Species Phylogenies from Multi-Locus Data
- 11 Designing Methods for Large-Scale Phylogeny Estimation
- Appendix A Primer on Biological Data and Evolution
- Appendix B Algorithm Design and Analysis
- Appendix C Guidelines forWriting Papers About Computational Methods
- Appendix D Projects
- References
- Index
Summary
Introduction
Phylogeny estimation generally begins by estimating a multiple sequence alignment on the set of sequences. Once the multiple sequence alignment is computed, a tree can then be computed on the alignment (Figure 9.1). Not surprisingly, errors in multiple sequence alignment estimation tend to produce errors in estimated trees (Ogden and Rosenberg, 2006; Nelesen et al., 2008; Liu et al., 2009a; Wang et al., 2012) and other downstream analyses. Hence, multiple sequence alignment is an important part of phylogeny estimation.
As we have seen, there are many methods for estimating trees from gap-free data. However, because multiple sequence alignments almost always contain gaps, represented as dashes, phylogeny estimation methods must be modified to be able to analyze alignments with dashes. Typically this is performed by treating the dashes as missing data (i.e., missing data means there is an actual nucleotide or amino acid, but it is not known). Alternatively, the dashes are sometimes treated as an additional state in the sequence evolution model, thus producing five states for nucleotide alignments or 21 states for amino acid alignments. Finally, sometimes sites (i.e., columns in the multiple sequence alignment) containing dashes are eliminated from the alignment before a tree is computed. The different treatments of sequence alignments can result in quite different theoretical and empirical performance.
Multiple sequence alignments are computed for different purposes, including phylogeny estimation and protein structure prediction, and the definition of what constitutes a correct alignment depends, at least in part, on the purpose for the alignment. For some biological datasets, curated alignments, typically based on experimentally confirmed structural features of the molecules (e.g., secondary structures or tertiary structures of RNAs and proteins), are used as benchmarks for evaluating alignment methods. Examples of such benchmarks for evaluating large amino acid alignments include HomFam (Sievers et al., 2011), BAliBASE (Thompson et al., 1999), and the 10AA collection (Nguyen et al., 2015b), while the Comparative Ribosomal Website (CRW) provides benchmarks for RNA alignment (Cannone et al., 2002). Evolutionary alignments, on the other hand, are defined by the evolutionary history relating the sequences.
- Type
- Chapter
- Information
- Computational PhylogeneticsAn Introduction to Designing Methods for Phylogeny Estimation, pp. 178 - 233Publisher: Cambridge University PressPrint publication year: 2017