Book contents
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
4 - Predicting Species: Statistical Models
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- Acknowledgments
- 1 The Central Dogma
- 2 RNA Secondary Structure
- 3 Comparing DNA Sequences
- 4 Predicting Species: Statistical Models
- 5 Substitution Matrices for Amino Acids
- 6 Sequence Databases
- 7 Local Alignment and the BLAST Heuristic
- 8 Statistics of BLAST Database Searches
- 9 Multiple Sequence Alignment I
- 10 Multiple Sequence Alignment II
- 11 Phylogeny Reconstruction
- 12 Protein Motifs and PROSITE
- 13 Fragment Assembly
- 14 Coding Sequence Prediction with Dicodons
- 15 Satellite Identification
- 16 Restriction Mapping
- 17 Rearranging Genomes: Gates and Hurdles
- A Drawing RNA Cloverleaves
- B Space-Saving Strategies for Alignment
- C A Data Structure for Disjoint Sets
- D Suggestions for Further Reading
- Bibliography
- Index
Summary
Suppose we are given a strand of DNA and asked to determine whether it comes from corn (Zea mays) or from fruit flies (Drosophila melanogaster). One very simple way to attack this problem is to analyze the relative frequencies of the nucleotides in the strand. Even before the double-helix structure of DNA was determined, researchers had observed that, while the numbers of Gs and Cs in a DNA were roughly equal (and likewise for As and Ts), the relative numbers of G + C and A + T differed from species to species. This relationship is usually expressed as percent GC, and species are said to be GC-rich or GC-poor. Corn is slightly GC-poor, with 49% GC. Fruit fly is GC-rich, with 55% GC.
We examine the first ten bases of our DNA and see: GATGTCGTAT. Is this DNA from corn or fruit fly?
First of all, it should be clear that we cannot get a definitive answer to the question by observing bases, especially just a few. Corn's protein-coding sequences are distinctly GC-rich, while its noncoding DNA is GC-poor. Such variations in GC content within a single genome are sometimes exploited to find the starting point of genes in the genome (near so-called CpG islands). In the absence of additional information, the best we can hope for is to learn whether it's “more likely” that we have corn or fly DNA, and how much more likely.
- Type
- Chapter
- Information
- Genomic PerlFrom Bioinformatics Basics to Working Code, pp. 44 - 54Publisher: Cambridge University PressPrint publication year: 2002