On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source

G. Nuel

doi:10.1239/jap/1294170523

On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source

Part of: Distribution theory Markov processes

Published online by Cambridge University Press: 14 July 2016

G. Nuel

Show author details

G. Nuel*: Affiliation:
Paris Descartes University
*: ∗ Postal address: MAP5, Department of Applied Mathematics, CNRS 8145, Paris Descartes University, 49 rue des Saints-Pères, F-75006 Paris, France. Email address: gregory.nuel@parisdescartes.fr

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

In this paper we develop an explicit formula that allows us to compute the first k moments of the random count of a pattern in a multistate sequence generated by a Markov source. We derive efficient algorithms that allow us to deal with any pattern (low or high complexity) in any Markov model (homogeneous or not). We then apply these results to the distribution of DNA patterns in genomic sequences, and we show that moment-based developments (namely Edgeworth's expansion and Gram-Charlier type-B series) allow us to improve the reliability of common asymptotic approximations, such as Gaussian or Poisson approximations.

Keywords

Optimal Markov chain embedding deterministic finite automaton moment generating function Edgeworth's expansion Gram-Charlier series

MSC classification

Secondary: 60J10: Markov chains (discrete-time Markov processes on discrete state spaces) 62E15: Exact distribution theory 62E17: Approximations to distributions (nonasymptotic)

Information

Type: Research Article
Information: Journal of Applied Probability , Volume 47 , Issue 4 , December 2010 , pp. 1105 - 1123

DOI: https://doi.org/10.1239/jap/1294170523 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2010

References

[1] Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials. J. Appl. Prob. 38, 508–518.Google Scholar

[2] Aroian, L. A. (1937). {The type B Gram–Charlier series}. Ann. Math. Statist. 8, 183–192.Google Scholar

[3] Beaudoing, E. et al. (2000). Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 1001–1010.Google Scholar

[4] Bernardeau, F. and Kofman, L. (1995). Properties of the cosmological density distribution function. Astrophys. J. 443, 479–498.Google Scholar

[5] Blinnikov, S. and Moessner, R. (1998). {Expansions for nearly Gaussian distributions}. Astron. Astrophys. Suppl. Ser. 130, 193–205.CrossRef Google Scholar

[6] Boeva, V., Clément, J., Régnier, M. and Vandenbogaert, M. (2005). Assessing the significance of sets of words. In Combinatorial Pattern Matching 05 (Lecture Notes Comput. Sci. 3537), Springer, Berlin.Google Scholar

[7] Boeva, V. et al. (2007). Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Molecular Biol. 2, 13.Google Scholar

[8] Brāzma, A., Jonassen, I., Vilo, J. and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, 1202–1215.Google Scholar

[9] Chang, Y.-M. (2005). Distribution of waiting time until the rth occurrence of a compound pattern. Statist. Prob. Lett. 75, 29–38.Google Scholar

[10] Cowan, R. (1991). Expected frequencies of DNA patterns using Whittle's formula. J. Appl. Prob. 28, 886–892.Google Scholar

[11] Crochemore, M. and Stefanov, V. T. (2003). Waiting time and complexity for matching patterns with automata. Inform. Process. Lett. 87, 119–125.Google Scholar

[12] Denise, A., Régnier, M. and Vandenbogaert, M. (2001). Assessing the statistical significance of overrepresented oligonucleotides. In Algorithms in Bioinformatics (Lecture Notes Comput. Sci. 2149), Springer, Berlin, pp. 85–97.Google Scholar

[13] El Karoui, M., Biaudet, V., Schbath, S. and Gruss, A. (1999). Characteristics of Chi distribution on different bacterial genomes. Res. Microbiol. 150, 579–587.Google Scholar

[14] Erhardsson, T. (2000). {Compound Poisson approximation for counts of rare patterns in Markov chains and extreme sojourns in birth-death chains}. Ann. Appl. Prob. 10, 573–591.Google Scholar

[15] Frith, M. C., Spouge, J. L., Hansen, U. and Weng, Z. (2002). Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucl. Acids Res. 30, 3214–3224.Google Scholar

[16] Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multi-state trials. Statistica Sinica 6, 957–974.Google Scholar

[17] Geske, M. X. et al. (1995). Compound Poisson approximations for word patterns under Markovian hypotheses. J. Appl. Prob. 32, 877–892.Google Scholar

[18] Godbole, A. P. (1991). Poisson approximations for runs and patterns of rare events. Adv. Appl. Prob. 23, 851–865.CrossRef Google Scholar

[19] Hampson, S., Kibler, D. and Baldi, P. (2002). {Distribution patterns of over-represented k-mers in non-coding yeast DNA}. Bioinformatics 18, 513–528.Google Scholar

[20] Karlin, S., Burge, C. and Campbell, A. (1992). {Statistical analyses of counts and distributions of restriction sites in DNA sequences}. Nucl. Acids Res. 20, 1363–1370.Google Scholar

[21] Kleffe, J. and Borodovsky, M. (1997). First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8, 433–441.Google Scholar

[22] Lladser, M. E. (2007). Minimal Markov chain embeddings of pattern problems. In Proc. 2007 Inform. Theory Appl. Workshop, University of California, San Diego, pp. 251–255.Google Scholar

[23] Lothaire, M. (ed.) (2005). {Applied Combinatorics on Words}. Cambridge University Press.Google Scholar

[24] Mariño-Ramírez, L., Spouge, J. L., Kanga, G. C. and Landsman, D. (2004). Statistical analysis of over-represented words in human promoter sequences. Nuc. Acids Res. 32, 949–958.CrossRef Google Scholar

[25] Nicodème, P., Salvy, B. and Flajolet, P. (2002). Motif statistics. Theoret. Comput. Sci. 287, 593–617.Google Scholar

[26] Nuel, G. (2004). LD-SPatt: large deviations statistics for patterns on Markov chains. J. Comput. Biol. 11, 1023–1033.Google Scholar

[27] Nuel, G. (2006). {Effective p-value computations using finite Markov chain imbedding (FMCI): application to local score and to pattern statistics}. Algorithms Molecular Biol. 1, 5.Google Scholar

[28] Nuel, G. (2006). Numerical solutions for patterns statistics on Markov chains. Statist. Appl. Genetics Molecular Biol. 5, 26.Google Scholar

[29] Nuel, G. (2008). {Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata}. J. Appl. Prob. 45, 226–243.Google Scholar

[30] Pevzner, P., Borodovski, M. Y. and Mironov, A. A. (1989). Linguistic of nucleotide sequences: the significance of deviation from mean statistical characteristics and prediction of frequencies of occurrence of words. J. Biomol. Struct. Dyn. 6, 1013–1026.Google Scholar

[31] Prum, B., Rodolphe, F. and de Turckheim, E. (1995). Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc. B 57, 205–220.Google Scholar

[32] Reignier, M. (2000). A unified approach to word occurrences probabilities. Discrete Appl. Math. 104, 259–280.Google Scholar

[33] Reinert, G. and Schbath, S. (1999). Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J. Comput. Biol. 5, 223–253.Google Scholar

[34] Ribeca, P. and Raineri, E. (2008). {Faster exact Markovian probability functions for motif occurrences: a DFA-only approach}. Bioinformatics 24, 2839–2848.Google Scholar

[35] Stefanov, V. T. and Pakes, A. G. (1997). Explicit distributional results in pattern formation. Ann. Appl. Prob. 7, 666–678.Google Scholar

[36] Stefanov, V. T. and Szpankowski, W. (2007). {Waiting time distributions for pattern occurrence in a constrained sequence}. Discrete Math. Theoret. Comput. Sci. 9, 305–320.Google Scholar

[37] Van Helden, J., André, B. and Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Molecular Biol. 281, 827–842.Google Scholar

Article contents

On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source

Abstract

Keywords

MSC classification

Information

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests