Hostname: page-component-78c5997874-8bhkd Total loading time: 0 Render date: 2024-11-17T23:18:55.496Z Has data issue: false hasContentIssue false

Approximate Sampling Formulae for General Finite-Alleles Models of Mutation

Published online by Cambridge University Press:  04 January 2016

Anand Bhaskar*
Affiliation:
University of California, Berkeley
John A. Kamm*
Affiliation:
University of California, Berkeley
Yun S. Song*
Affiliation:
University of California, Berkeley
*
Postal address: Computer Science Division, University of California, Berkeley, CA 94720, USA.
∗∗ Postal address: Department of Statistics, University of California, Berkeley, CA 94720, USA.
∗∗∗ Postal address: Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA. Email address: yss@stat.berkeley.edu
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation, such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulae for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulae derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.

Type
General Applied Probability
Copyright
© Applied Probability Trust 

References

Arratia, A., Barbour, A. D. and Tavaré, S. (2003). Logarithmic Combinatorial Structures: A Probabilistic Approach. European Mathematical Society, Zürich.Google Scholar
Bhaskar, A. and Song, Y. S. (2012). Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci. Adv. Appl. Prob. 44, 391407.CrossRefGoogle ScholarPubMed
Ewens, W. J. (1972). The sampling theory of selectively neutral alleles. Theoret. Pop. Biol. 3, 87112.CrossRefGoogle ScholarPubMed
Fu, Y.-X. (1995). Statistical properties of segregating sites. Theoret. Pop. Biol. 48, 172197.Google Scholar
Griffiths, R. C. (2003). The frequency spectrum of a mutation, and its age, in a general diffusion model. Theoret. Pop. Biol. 64, 241251.Google Scholar
Griffiths, R. C. and Lessard, S. (2005). Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. Theoret. Pop. Biol. 68, 167–77.Google Scholar
Griffiths, R. C. and Tavaré, S. (1994). Ancestral inference in population genetics. Statist. Sci. 9, 307319.CrossRefGoogle Scholar
Griffiths, R. C. and Tavaré, S. (1994). Sampling theory for neutral alleles in a varying environment. Phil. Trans. R. Soc. London B 344, 403410.Google Scholar
Hoppe, F. M. (1984). Pólya-like urns and the Ewens' sampling formula. J. Math. Biol. 20, 9194.Google Scholar
Jenkins, P. A. and Song, Y. S. (2009). Closed-form two-locus sampling distributions: accuracy and universality. Genetics 183, 10871103.CrossRefGoogle ScholarPubMed
Jenkins, P. A. and Song, Y. S. (2010). An asymptotic sampling formula for the coalescent with recombination. Ann. Appl. Prob. 20, 10051028.Google Scholar
Jenkins, P. A. and Song, Y. S. (2011). The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. Theoret. Pop. Biol. 80, 158173.Google Scholar
Jenkins, P. A. and Song, Y. S. (2012). Padé approximants and exact two-locus sampling distributions. Ann. Appl. Prob. 22, 576607.Google Scholar
Kingman, J. F. C. (1982). The coalescent. Stoch. Process. Appl. 13, 235248.Google Scholar
Kingman, J. F. C. (1982). On the genealogy of large populations. In Essays in Statistical Science (J. Appl. Prob. Spec. Vol. 19A), eds Gani, J. and Hannan, E. J., Applied Probability Trust, Sheffield, pp. 2743.Google Scholar
Nachman, M. W. and Crowell, S. L. (2000). Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297304.Google Scholar
Pitman, J. (1992). The two-parameter generalization of Ewens' random partition structure. Tech. Rep. 345, Department of Statistics, University of California, Berkeley.Google Scholar
Pitman, J. (1995). Exchangeable and partially exchangeable random partitions. Prob. Theory Relat. Fields 102, 145158.CrossRefGoogle Scholar
Stephens, M. (2001). Inference under the coalescent. In Handbook of Statistical Genetics, eds Balding, D., Bishop, M., and Cannings, C., John Wiley, Chichester, pp. 213238.Google Scholar
Wright, S. (1949). Adaptation and selection. In Genetics, Paleontology, and Evolution, eds Jepson, G. L., Simpson, G. G., and Mayr, E., Princeton University Press, pp. 365389.Google Scholar
Yang, Z. (1994). Estimating the pattern of nucleotide substitution. J. Molec. Evol. 39, 105111.Google Scholar