Multiple Sequence Alignment

Tandy Warnow

doi:10.1017/9781316882313.011

Introduction

Phylogeny estimation generally begins by estimating a multiple sequence alignment on the set of sequences. Once the multiple sequence alignment is computed, a tree can then be computed on the alignment (Figure 9.1). Not surprisingly, errors in multiple sequence alignment estimation tend to produce errors in estimated trees (Ogden and Rosenberg, 2006; Nelesen et al., 2008; Liu et al., 2009a; Wang et al., 2012) and other downstream analyses. Hence, multiple sequence alignment is an important part of phylogeny estimation.

As we have seen, there are many methods for estimating trees from gap-free data. However, because multiple sequence alignments almost always contain gaps, represented as dashes, phylogeny estimation methods must be modified to be able to analyze alignments with dashes. Typically this is performed by treating the dashes as missing data (i.e., missing data means there is an actual nucleotide or amino acid, but it is not known). Alternatively, the dashes are sometimes treated as an additional state in the sequence evolution model, thus producing five states for nucleotide alignments or 21 states for amino acid alignments. Finally, sometimes sites (i.e., columns in the multiple sequence alignment) containing dashes are eliminated from the alignment before a tree is computed. The different treatments of sequence alignments can result in quite different theoretical and empirical performance.

Multiple sequence alignments are computed for different purposes, including phylogeny estimation and protein structure prediction, and the definition of what constitutes a correct alignment depends, at least in part, on the purpose for the alignment. For some biological datasets, curated alignments, typically based on experimentally confirmed structural features of the molecules (e.g., secondary structures or tertiary structures of RNAs and proteins), are used as benchmarks for evaluating alignment methods. Examples of such benchmarks for evaluating large amino acid alignments include HomFam (Sievers et al., 2011), BAliBASE (Thompson et al., 1999), and the 10AA collection (Nguyen et al., 2015b), while the Comparative Ribosomal Website (CRW) provides benchmarks for RNA alignment (Cannone et al., 2002). Evolutionary alignments, on the other hand, are defined by the evolutionary history relating the sequences.

Book contents

9 - Multiple Sequence Alignment

Summary

Access options

Book contents

9 - Multiple Sequence Alignment

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive