Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-22T22:43:01.015Z Has data issue: false hasContentIssue false

Viterbi training in PRISM

Published online by Cambridge University Press:  28 January 2014

TAISUKE SATO
Affiliation:
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo, Japan (e-mail: sato@mi.cs.titech.ac.jp)
KEIICHI KUBOTA
Affiliation:
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo, Japan (e-mail: kubota@mi.cs.titech.ac.jp)

Abstract

VT (Viterbi training), or hard expectation maximization (EM), is an efficient way of parameter learning for probabilistic models with hidden variables. Given an observation y, it searches for a state of hidden variables x that maximizes p(x,y | θ) by coordinate ascent on parameters θ and x. In this paper we introduce VT to PRogramming In Statistical Modeling (PRISM), a logic-based probabilistic modeling system for generative models. VT improves PRISM in three ways. First, VT in PRISM converges faster than EM in PRISM due to VT's termination condition. Second, parameters learned by VT often show good prediction performance compared with those learned by EM. We conducted two parsing experiments with probabilistic grammars while learning parameters by a variety of inference methods, i.e. VT, EM, MAP and VB. The result is that VT achieved the best parsing accuracy among them in both experiments. Also, we conducted a similar experiment for classification tasks where a hidden variable is not a prediction target unlike probabilistic grammars. We found that in such a case VT does not necessarily yield superior performance. Third, since VT always deals with a single probability of a single explanation, Viterbi explanation, the exclusiveness condition imposed on PRISM programs is no more required if we learn parameters by VT. Last but not least, we can say that as VT in PRISM is general and applicable to any PRISM program, it largely reduces the need for the user to develop a specific VT algorithm for a specific model. Furthermore, since VT in PRISM can be used just by setting a PRISM flag appropriately, it makes VT easily accessible to (probabilistic) logic programmers.

Type
Regular Papers
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bache, K. and Lichman, M. 2013. UCI Machine Learning Repository [http://archiveics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.Google Scholar
Bellodi, E. and Riguzzi, F. 2012. Expectation maximization over binary decision diagrams for probabilistic logic programs. Intelligent Data Analysis 16, 6.Google Scholar
Brown, P., Pietra, V., Pietra, S. and Mercer, R. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263311.Google Scholar
Castillo, G. and Gama, J. 2005. Bias management of Bayesian network classifiers. In Discovery Science – DS 2005, 8th International Conference, Singapore, Lecture Notes in Artificial Intelligence, Vol. 3735. Springer-Verlag, New York, NY, 7083.Google Scholar
Cohen, S. and Smith, N. 2010. Viterbi training for PCFGs: Hardness results and competitiveness of uniform initialization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL'10). 1502–1511.Google Scholar
De Raedt, L. and Kersting, K. 2008. Probabilistic inductive logic programming. In Probabilistic Inductive Logic Programming – Theory and Applications, Raedt, L. De, Frasconi, P., Kersting, K., and Muggleton, S., Eds. Lecture Notes in Computer Science, Vol. 4911. Springer, New York, NY, 127.Google Scholar
De Raedt, L., Kimmig, A. and Toivonen, H. 2007. ProbLog: A probabilistic Prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07). MIT Press, Cambridge, MA, 24682473.Google Scholar
Friedman, N., Geiger, D. and Goldszmidt, M. 1997. Bayesian network classifiers. Machine Learning 29, 2, 131163.Google Scholar
Getoor, L. and Taskar, B., Eds. 2007. Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA.Google Scholar
Goodman, J. 1996. Parsing algorithms and metrics. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL'96). ACL, New York, NY, 177183.Google Scholar
Gutmann, B., Kimmig, A., Kersting, K. and De Raedt, L. 2008. Parameter learning in probabilistic databases: A least squares approach. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2008), Part I. Springer, New York, NY, 473488.Google Scholar
Gutmann, B., Thon, I. and De Raedt, L. 2011. Learning the parameters of probabilistic logic programs from interpretations. In Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD 2011), Part I, LNCS, Vol. 6911. Springer, New York, NY, 581596.Google Scholar
Huynh, T. and Mooney, R. 2010. Online max-margin weight learning with Markov logic networks. In Proceedings of the AAAI-10 Workshop on Statistical Relational AI (Star-AI 10). 32–37.Google Scholar
Japkowicz, N. and Shah, M., Eds. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge, UK.Google Scholar
Jiang, L., Zhang, H. and Cai, Z. 2009. A novel Bayes model: Hidden naive Bayes. IEEE Transactions on Knowledge and Data Engineering 21, 10, 13611371.Google Scholar
Joshi, D., Li, J. and Wang, J. 2006. A computationally efficient approach to the estimation of two- and three-dimensional hidden Markov models. IEEE Transactions on Image Processing 15, 7, 18711886.Google Scholar
Juang, B. and Rabiner, L. 1990. The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Signal Processing 38, 16391641.Google Scholar
Kimmig, A., Costa, V., Rocha, R., Demoen, B. and De Raedt, L. 2008. On the efficient execution of ProbLog programs. In Proceedings of the 24th International Conference on Logic Programming (ICLP'08). 175–189.Google Scholar
Lember, J. and Koloydenko, A. 2007. Adjusted viterbi training. Probability in the Engineering and Informational Sciences 21, 3, 451475.Google Scholar
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. and Borodovsky, M. 2005. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research 33, 64946506.CrossRefGoogle ScholarPubMed
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281–297.Google Scholar
Manning, C. 1997. Probabilistic parsing using left corner language models. In Proceedings of the 5th International Conference on Parsing Technologies (IWPT-97). MIT Press, Cambridge, MA, 147158.Google Scholar
Riguzzi, F. and Swift, T. 2011. The PITA system: Tabling and answer subsumption for reasoning under uncertainty. Theory and Practice of Logic Programming (TPLP) 11, 4–5, 433449.CrossRefGoogle Scholar
Roark, B. and Johnson, M. 1999. Efficient probabilistic top-down and left-corner parsing. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 421–428.Google Scholar
Sato, T. 1995. A statistical learning method for logic programs with distribution semantics. In Proceedings of the 12th International Conference on Logic Programming (ICLP'95). Cambridge University Press, Cambridge, UK, 715729.Google Scholar
Sato, T. 2007. Inside-outside probability computation for belief propagation. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07). 2605–2610.Google Scholar
Sato, T. 2011. A general MCMC method for Bayesian inference in logic-based probabilistic modeling. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI '11). 1472–1477.Google Scholar
Sato, T. and Kameya, Y. 2001. Parameter learning of logic programs for symbolic-statistical modeling. Journal of Artificial Intelligence Research 15, 391454.Google Scholar
Sato, T. and Kameya, Y. 2008. New advances in logic-based probabilistic modeling by PRISM. In Probabilistic Inductive Logic Programming, De Raedt, L., Frasconi, P., Kersting, K. and Muggleton, S., Eds. LNAI, Vol. 4911. Springer, New York, NY, 118155.Google Scholar
Sato, T., Kameya, Y. and Kurihara, K. 2009. Variational Bayes via propositionalized probability computation in PRISM. Annals of Mathematics and Artificial Intelligence 54, 135158.CrossRefGoogle Scholar
Singla, P. and Domingos, P. 2005. Discriminative training of Markov logic networks. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), Veloso, M. M. and Kambhampati, S., Eds. Kluwer, the Netherlands, 868873.Google Scholar
Spitkovsky, V., Alshawi, H., Jurafsky, D. and Manning, C. 2010. Viterbi training improves unsupervised dependency parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning. 9–17.Google Scholar
Strom, N., Hetherington, L., Hazen, T., Sandness, E. and Glass, J. 1999. Acoustic modeling improvements in a segment-based speech recognizer. In Proceedings of IEEE ASRU Workshop (ASRU'99). IEEE Signal Processing Society, 139142.Google Scholar
Su, J. and Zhang, H. 2006. Full Bayesian network classifiers. In Proceedings of the 23rd International Conference on Machine Learning (ICML'06). 897–904.CrossRefGoogle Scholar
Uratani, N., Takezawa, T., Matsuo, H. and Morita, C. 1994. ATR Integrated Speech and Language Database. Technical Report TR-IT-0056, ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan. (in Japanese).Google Scholar
Van Uytsel, D., Van Compernolle, D. and Wambacq, P. 2001. Maximum-likelihood training of the PLCG-based language model. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU'01). IEEE Signal Processing Society, 210213.Google Scholar
Webb, G., Boughton, J. and Wang, Z. 2005. Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning 58, 1, 524.Google Scholar
Zhou, N.-F., Kameya, Y. and Sato, T. 2010. Mode-directed tabling for dynamic programming, machine learning, and constraint solving. In Proceedings of the 22th International Conference on Tools with Artificial Intelligence (ICTAI-2010). IEEE Computer Society, 213218.Google Scholar
Zhou, N.-F., Sato, T. and Shen, Y.-D. 2008. Linear tabling strategies and optimization. Theory and Practice of Logic Programming (TPLP) 8, 1, 81109.Google Scholar