Hostname: page-component-77c89778f8-7drxs Total loading time: 0 Render date: 2024-07-16T20:30:30.620Z Has data issue: false hasContentIssue false

Statistical tools for discovering pseudo-periodicities inbiological sequences

Published online by Cambridge University Press:  15 August 2002

Bernard Prum
Affiliation:
Laboratoire Statistique et Génome, URA 8071 du CNRS, La Génopole, Université d'Evry, France; prum@genopole.cnrs.fr.
Élisabeth de Turckheim
Affiliation:
Institut National de la Recherche Agronomique, BIA, 78352 Jouy-en-Josas, France; et@jouy.inra.fr.
Martin Vingron
Affiliation:
Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, 14195 Berlin, Germany; vingron@molgem.mpg.de.
Get access

Abstract


Many protein sequences present non trivial periodicities, such as cysteine signatures and leucine heptads. These known periodicities probably represent a small percentage of the total number of sequences periodic structures, and it is useful to have general tools to detect such sequences and their period in large databases of sequences. We compare three statistics adapted from those used in time series analysis: a generalisation of the simple autocovariance based on a similarity score and two statistics intending to increase the power of the method. Theoretical behaviour of these statistics are derived, and the corresponding tests are then described. In this paper we also present an application of these tests to a protein known to have sequence periodicity.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 2001

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Argos, P., Evidence for a repeating domain in type I restriction enzyme. European Molecular Biology Organization J. 4 (1985) 1351-1355.
Benson, G. and Waterman, M.S., A method for fast data search for all k-nucleotide repeats. Nucleic Acids Res. 20 (1994) 2019-2022.
Boguski, M.S.M., Hardison, R.C., Schwart, S. and Miller, W., Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control using new software tools for multiple alignments and visualization. The New Biologist 4 (1992) 247-260.
Bressan, G.M., Argos, P. and Stanley, K.K., Repeating structure of chick tropoelastin revealed by complementary DNA cloning. Biochemistry 26 (1987) 1497-1503. CrossRef
P.J. Brockwell and R.A. Davis, Time Series: Theory and Methods. Springer-Verlag (1987).
Brown, R.S., Sander, C. and Argos, P., The primary structure of transcription factor TF III A has 12 consecutive repeats. Federation of European Biochemical Society Letter 186 (1985) 271-274. CrossRef
Cornette, J.L., Cease, K.B., Margalit, H., Sponge, J.L., Berzofsky, J.A. and DeLisi, Ch., Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J. Molecular Biology 195 (1987) 659-685. CrossRef
Coward, E., Detecting periodicity pattern in biological sequences. Bioinformatics 14-6 (1998) 498-507. CrossRef
M.O. Dayhoff, R. Schwartz and B.C. Orcutt, A model of evolutionary change in protein, edited by M.O. Dayhoff. National Biomedical Research Foundation, Washington D.C., Atlas of Protein Sequences and Structure 5-3 (1978) 345-352.
P. Doukhan, Mixing, properties and examples. Springer Verlag, Lecture Notes in Statist. 85 (1985).
Fischetti, V.A., Landau, G.M. and Seller, P.H., Identifying period occurences of a template with application to protein structure. Inform. Process. Lett. 45-1 (1993) 11-18. CrossRef
Fitch, W., Phylogenies constrained by cross-over process as illustrated by human hemoglobins an a thirteen-cycle, eleven amino-acid repeat in human apolipoprotein AI. Genetics 86 (1977) 623-644.
Hennikoff, S. and Henikoff, J.G., Amino acid substitution matrices from protein blocks for database research. Nucleid Acid Res. 19 (1992) 6565-6572. CrossRef
Heringa, J. and P.Argos, A method to recognize distant repeats in protein sequences. Proteins 17-4 (1993) 391-441. CrossRef
I.A. Ibragimov, On a central limit theorem for dependent random variables. Theory Probab. Appl.15 (1975).
Labeit, S., Gautel, M., Lakey, A. and Trinick, J., Towards a molecular understanding of titin. European Molecular Biology Organization J. 11 (1992) 1711-1716.
Lupas, A., van Dyke, M. and Stock, J., Predicting coiled coils from protein sequences. Science 252 (1991) 1162-1164. CrossRef
McLachlan, A.D., Analysis of periodic patterns in amino-acid sequences: Collagen. Biopolymers 16 (1977) 1271-1297. CrossRef
McLachlan, A.D., Repeated helical patterns in apolipoprotein AI. Nature 267 (1977) 465-466. CrossRef
McLachlan, A.D. and Karn, J., Periodic features in the amino-acid sequence of nematod myosin rod. J. Molecular Biology 220 (1983) 79-88.
McLachlan, A.D. and Stewart, M., The 14-fold periodicity in alpha-tropomyosin and the interaction with actin. J. Molecular Biology 103 (1976) 271-298. CrossRef
McLachlan, A.D., Stewart, M., Hynes, R.O. and Rees, D.J., Analysis of repeated motifs in talin rod. J. Molecular Biology 235-4 (1994) 1278-1290. CrossRef
Miller, J., McLachlan, A.D. and Klug, A., Repetitive zinc-binding domains in the transcription factor IIIA from Xenopus oocytes. European Molecular Biology Organization J. 4 (1985) 1609-1614.
R.J. Serfling, Approximation Theorems of mathematical statistics. Wiley (1980).