Hostname: page-component-7479d7b7d-q6k6v Total loading time: 0 Render date: 2024-07-12T15:36:44.521Z Has data issue: false hasContentIssue false

Large deviations-based upper bounds on the expected relative length of longest common subsequences

Published online by Cambridge University Press:  01 July 2016

Raphael Hauser*
Affiliation:
University of Oxford
Servet Martínez*
Affiliation:
Universidad de Chile
Heinrich Matzinger*
Affiliation:
Universität Bielefeld and Georgia Institute of Technology
*
Postal address: Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD, UK. Email address: hauser@comlab.ox.ac.uk
∗∗ Postal address: CMM-DIM-CNRS 2071, Universidad de Chile, Casilla 170-3 Correo 3, Santiago, Chile. Email address: smartine@dim.uchile.cl
∗∗∗ Postal address: Fakultät für Mathematik, Universität Bielefeld, D-33501 Bielefeld, Germany. Email address: matzing@mathematik.uni-bielefeld.de
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Consider the random variable Ln defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=limn→∞E[Ln]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (qm)m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂m, of the qm such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.

Type
General Applied Probability
Copyright
Copyright © Applied Probability Trust 2006 

References

Aldous, D. and Diaconis, P. (1999). Longest increasing subsequences: from patience sorting to the Baik–Deift–Johansson theorem. Bull. Amer. Math. Soc. 36, 413432.Google Scholar
Alexander, K. S. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Prob. 4, 10741082.Google Scholar
Apostolico, A., Crochemore, M., Galil, Z. and Manber, U. (eds) (1993). Combinatorial Pattern Matching (Lecture Notes Comput. Sci. 684). Springer, Berlin.Google Scholar
Arratia, R. and Waterman, M. S. (1989). The Erdős–Rényi strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 17, 11521169.Google Scholar
Arratia, R. and Waterman, M. S. (1994). A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob. 4, 200225.CrossRefGoogle Scholar
Arratia, R., Goldstein, L. and Gordon, L. (1989). Two moments suffice for Poisson approximations: the Chen–Stein method. Ann. Prob. 17, 925.Google Scholar
Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erdős–Rényi law in distribution, for coin tossing and sequence matching. Ann. Statist. 18, 539570.Google Scholar
Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohuku Math. J. 19, 357367.Google Scholar
Baeza-Yates, R. A., Gavaldà, R., Navarro, G. and Scheihing, R. (1999). Bounding the expected length of longest common subsequences and forests. Theory Comput. Systems 32, 435452.Google Scholar
Baik, J., Deift, P. and Johansson, K. (1999). On the distribution of the length of the longest increasing subsequence of random permutations. J. Amer. Math. Soc. 12, 11191178.CrossRefGoogle Scholar
Capocelli, R. M. (ed.) (1990). Sequences. Springer, New York.Google Scholar
Capocelli, R., De Santis, A. and Vaccaro, U. (eds) (1993). Sequences. II. Springer, New York.Google Scholar
Chvátal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Prob. 12, 306315.Google Scholar
Dančı´k, V. and Paterson, M. (1995). Upper bounds for the expected length of a longest common subsequence of two binary sequences. Random Structures Algorithms 6, 449458.Google Scholar
Decouvelaere, Q. (2003). Upper bounds for the LCS problem. , Computing Laboratory, University of Oxford.Google Scholar
Deken, J. G. (1979). Some limit results for longest common subsequences. Discrete Math. 26, 1731.Google Scholar
Hauser, R. and Matzinger, H. (2005). Local uniqueness of alignments with a fixed proportion of gaps. Res. Rep. NA-05/08, Numerical Analysis Group, Computing Laboratory, University of Oxford. Available at http://web.comlab.ox.ac.uk/oucl/publications/natr/na-05-08.html.Google Scholar
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 1330.Google Scholar
Kiwi, M., Loebl, M. and Matoušek, J. (2004). Expected length of the longest common subsequence for large alphabets. In LATIN 2004: Theoretical Informatics (Lecture Notes Comput. Sci. 2976), Springer, Berlin, pp. 302311.Google Scholar
Krengel, U. (1985). Ergodic Theorems. De Gruyter, Berlin.Google Scholar
Kruskal, J. B. (1983). An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25, 201237.Google Scholar
Lember, J. and Matzinger, H. (2005). Fluctuation of the LCS-score when letters are not equiprobable. Preprint.Google Scholar
Neuhauser, C. (1994). A Poisson approximation for sequence comparisons with insertions and deletions. Ann. Statist. 22, 16031629.Google Scholar
Paterson, M. and Dančı´k, V. (1994). Longest common subsequences. In Mathematical Foundations of Computer Science (Lecture Notes Comput. Sci. 841), Springer, Berlin, pp. 127142.Google Scholar
Sankoff, D. and Kruskal, J. B. (eds) (1983). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.Google Scholar
Steele, M. J. (1986). An Efron–Stein inequality for nonsymmetric statistics. Ann. Statist. 14, 753758.Google Scholar
Wagner, R. A. and Fischer, M. J. (1974). The string-to-string correction problem. J. Assoc. Comput. Mach. 21, 168173.CrossRefGoogle Scholar
Waterman, M. (1995). Introduction to Computational Biology. Chapman and Hall, London.Google Scholar
Waterman, M. S. (1984). General methods of sequence comparison. Bull. Math. Biol. 46, 473500.Google Scholar
Waterman, M. S. (1994). Estimating statistical significance of sequence alignments. Phil. Trans. R. Soc. London B 344, 383390.Google ScholarPubMed
Waterman, M. S. and Vingron, M. (1994). Sequence comparison significance and Poisson approximation. Statist. Sci. 9, 367381.Google Scholar