Large deviations-based upper bounds on the expected relative length of longest common subsequences

Raphael Hauser; Servet Martínez; Heinrich Matzinger

doi:10.1239/aap/1158685004

Large deviations-based upper bounds on the expected relative length of longest common subsequences

Part of: Combinatorics Parametric inference Chemistry

Published online by Cambridge University Press: 01 July 2016

Raphael Hauser ,

Servet Martínez and

Heinrich Matzinger

Show author details

Raphael Hauser*: Affiliation:
University of Oxford
Servet Martínez*: Affiliation:
Universidad de Chile
Heinrich Matzinger*: Affiliation:
Universität Bielefeld and Georgia Institute of Technology
*: ∗ Postal address: Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford OX1 3QD, UK. Email address: hauser@comlab.ox.ac.uk
∗∗ Postal address: CMM-DIM-CNRS 2071, Universidad de Chile, Casilla 170-3 Correo 3, Santiago, Chile. Email address: smartine@dim.uchile.cl
∗∗∗ Postal address: Fakultät für Mathematik, Universität Bielefeld, D-33501 Bielefeld, Germany. Email address: matzing@mathematik.uni-bielefeld.de

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Consider the random variable Ln defined as the length of a longest common subsequence of two random strings of length n and whose random characters are independent and identically distributed over a finite alphabet. Chvátal and Sankoff showed that the limit γ=limn→∞E[Ln]/n is well defined. The exact value of this constant is not known, but various methods for the computation of upper and lower bounds have been discussed in the literature. Even so, high-precision bounds are hard to come by. In this paper we discuss how large deviation theory can be used to derive a consistent sequence of upper bounds, (qm)m∈ℕ, on γ, and how Monte Carlo simulation can be used in theory to compute estimates, q̂m, of the qm such that, for given Ξ > 0 and Λ ∈ (0,1), we have P[γ < q̂ < γ + Ξ] ≥ Λ. In other words, with high probability the result is an upper bound that approximates γ to high precision. We establish O((1 − Λ)−1Ξ−(4+ε)) as a theoretical upper bound on the complexity of computing q̂m to the given level of accuracy and confidence. Finally, we discuss a practical heuristic based on our theoretical approach and discuss its empirical behavior.

Keywords

Longest common subsequence problem Chvátal-Sankoff constant upper bound large deviation theory Monte Carlo simulation

MSC classification

Primary: 05A16: Asymptotic enumeration 62F10: Point estimation

Secondary: 92E10: Molecular structure (graph-theoretic methods, methods of differential topology, etc.)

Type: General Applied Probability
Information: Advances in Applied Probability , Volume 38 , Issue 3 , September 2006 , pp. 827 - 852

DOI: https://doi.org/10.1239/aap/1158685004 [Opens in a new window]
Copyright: Copyright © Applied Probability Trust 2006

References

Aldous, D. and Diaconis, P. (1999). Longest increasing subsequences: from patience sorting to the Baik–Deift–Johansson theorem. Bull. Amer. Math. Soc. 36, 413–432.Google Scholar

Alexander, K. S. (1994). The rate of convergence of the mean length of the longest common subsequence. Ann. Appl. Prob. 4, 1074–1082.Google Scholar

Apostolico, A., Crochemore, M., Galil, Z. and Manber, U. (eds) (1993). Combinatorial Pattern Matching (Lecture Notes Comput. Sci. 684). Springer, Berlin.Google Scholar

Arratia, R. and Waterman, M. S. (1989). The Erdős–Rényi strong law for pattern matching with a given proportion of mismatches. Ann. Prob. 17, 1152–1169.Google Scholar

Arratia, R. and Waterman, M. S. (1994). A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob. 4, 200–225.CrossRef Google Scholar

Arratia, R., Goldstein, L. and Gordon, L. (1989). Two moments suffice for Poisson approximations: the Chen–Stein method. Ann. Prob. 17, 9–25.Google Scholar

Arratia, R., Gordon, L. and Waterman, M. S. (1990). The Erdős–Rényi law in distribution, for coin tossing and sequence matching. Ann. Statist. 18, 539–570.Google Scholar

Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohuku Math. J. 19, 357–367.Google Scholar

Baeza-Yates, R. A., Gavaldà, R., Navarro, G. and Scheihing, R. (1999). Bounding the expected length of longest common subsequences and forests. Theory Comput. Systems 32, 435–452.Google Scholar

Baik, J., Deift, P. and Johansson, K. (1999). On the distribution of the length of the longest increasing subsequence of random permutations. J. Amer. Math. Soc. 12, 1119–1178.CrossRef Google Scholar

Capocelli, R. M. (ed.) (1990). Sequences. Springer, New York.Google Scholar

Capocelli, R., De Santis, A. and Vaccaro, U. (eds) (1993). Sequences. II. Springer, New York.Google Scholar

Chvátal, V. and Sankoff, D. (1975). Longest common subsequences of two random sequences. J. Appl. Prob. 12, 306–315.Google Scholar

Dančı´k, V. and Paterson, M. (1995). Upper bounds for the expected length of a longest common subsequence of two binary sequences. Random Structures Algorithms 6, 449–458.Google Scholar

Decouvelaere, Q. (2003). Upper bounds for the LCS problem. , Computing Laboratory, University of Oxford.Google Scholar

Deken, J. G. (1979). Some limit results for longest common subsequences. Discrete Math. 26, 17–31.Google Scholar

Hauser, R. and Matzinger, H. (2005). Local uniqueness of alignments with a fixed proportion of gaps. Res. Rep. NA-05/08, Numerical Analysis Group, Computing Laboratory, University of Oxford. Available at http://web.comlab.ox.ac.uk/oucl/publications/natr/na-05-08.html.Google Scholar

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13–30.Google Scholar

Kiwi, M., Loebl, M. and Matoušek, J. (2004). Expected length of the longest common subsequence for large alphabets. In LATIN 2004: Theoretical Informatics (Lecture Notes Comput. Sci. 2976), Springer, Berlin, pp. 302–311.Google Scholar

Krengel, U. (1985). Ergodic Theorems. De Gruyter, Berlin.Google Scholar

Kruskal, J. B. (1983). An overview of sequence comparison: time warps, string edits, and macromolecules. SIAM Rev. 25, 201–237.Google Scholar

Lember, J. and Matzinger, H. (2005). Fluctuation of the LCS-score when letters are not equiprobable. Preprint.Google Scholar

Neuhauser, C. (1994). A Poisson approximation for sequence comparisons with insertions and deletions. Ann. Statist. 22, 1603–1629.Google Scholar

Paterson, M. and Dančı´k, V. (1994). Longest common subsequences. In Mathematical Foundations of Computer Science (Lecture Notes Comput. Sci. 841), Springer, Berlin, pp. 127–142.Google Scholar

Sankoff, D. and Kruskal, J. B. (eds) (1983). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.Google Scholar

Steele, M. J. (1986). An Efron–Stein inequality for nonsymmetric statistics. Ann. Statist. 14, 753–758.Google Scholar

Wagner, R. A. and Fischer, M. J. (1974). The string-to-string correction problem. J. Assoc. Comput. Mach. 21, 168–173.CrossRef Google Scholar

Waterman, M. (1995). Introduction to Computational Biology. Chapman and Hall, London.Google Scholar

Waterman, M. S. (1984). General methods of sequence comparison. Bull. Math. Biol. 46, 473–500.Google Scholar

Waterman, M. S. (1994). Estimating statistical significance of sequence alignments. Phil. Trans. R. Soc. London B 344, 383–390.Google Scholar PubMed

Waterman, M. S. and Vingron, M. (1994). Sequence comparison significance and Poisson approximation. Statist. Sci. 9, 367–381.Google Scholar

Article contents

Large deviations-based upper bounds on the expected relative length of longest common subsequences

Abstract

Keywords

MSC classification

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests