Statistics on Words with Applications to Biological Sequences

M. Lothaire

doi:10.1017/CBO9781107341005.007

Chapter 6 - Statistics on Words with Applications to Biological Sequences

Published online by Cambridge University Press: 05 June 2013

M. Lothaire

Book contents

Get access

Summary

Introduction

Statistical and probabilistic properties of words in sequences have been of considerable interest in many fields, such as coding theory and reliability theory, and most recently in the analysis of biological sequences. The latter will serve as the key example in this chapter. We only consider finite words.

Two main aspects of word occurrences in biological sequences are: where do they occur and how many times do they occurfi An important problem, for instance, was to determine the statistical significance of a word frequency in a DNA sequence. The naive idea is the following: a word may be significantly rare in a DNA sequence because it disrupts replication or gene expression, (perhaps a negative selection factor), whereas a significantly frequent word may have a fundamental activity with regard to genome stability. Well-known examples of words with exceptional frequencies in DNA sequences are certain biological palindromes corresponding to restriction sites avoided, for instance in E. coli, and the Cross-over Hotspot Instigator sites in several bacteria. Identifying over- and underrepresented words in a particular genome is a very common task in genome analysis.

Statistical methods of studying the distribution of the word locations along a sequence and word frequencies have also been an active field of research; the goal of this chapter is to provide an overview of the state of this research.

Information

Type: Chapter
Information: Applied Combinatorics on Words , pp. 268 - 352

DOI: https://doi.org/10.1017/CBO9781107341005.007 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2005

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.