Bibliography

Hui Jiang

doi:10.1017/9781108938051.022

Bibliography

Published online by Cambridge University Press: 18 November 2021

Hui Jiang

Show author details

Hui Jiang: Affiliation:
York University, Toronto

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: Machine Learning Fundamentals
A Concise Introduction
, pp. 381 - 396

DOI: https://doi.org/10.1017/9781108938051.022 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abramowitz, Milton and Stegun, Irene A.. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Mineola, NY: Dover, 1964 (cited on pages 331, 379).Google Scholar

Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. ‘Wasserstein Generative Adversarial Networks’. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Precup, Doina and Teh, Yee Whye. Vol. 70. Sydney, Australia: PMLR, 2017, pp. 214–223 (cited on page 295).Google Scholar

Asadi, Behnam and Jiang, Hui. ‘On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks’. In: CoRR abs/2002.04060 (2020) (cited on page 155).Google Scholar

Attias, Hagai. ‘Independent Factor Analysis’. In: Neural Computation 11.4 (1999), pp. 803–851. doi: 10.1162/089976699300016458 (cited on pages 293, 294, 301, 302).Google Scholar

Attias, Hagai. ‘A Variational Bayesian Framework for Graphical Models’. In: Advances in Neural Information Processing Systems 12. Cambridge, MA: MIT Press, 2000, pp. 209–215 (cited on pages 324, 326, 357).Google Scholar

Azevedo-Filho, Adriano. ‘Laplace's Method Approximations for Probabilistic Inference in Belief Networks with Continuous Variables’. In: Uncertainty in Artificial Intelligence. Ed. by de Mantaras, Ramon Lopez and Poole, David. San Francisco, CA: Morgan Kaufmann, 1994, pp. 28–36 (cited on page 324).Google Scholar

Ba, Lei Jimmy, Kiros, Jamie Ryan, and Hinton, Geoffrey E.. ‘Layer Normalization’. In: CoRR abs/1607.06450 (2016) (cited on page 160).Google Scholar

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. ‘Neural Machine Translation by Jointly Learning to Align and Translate’. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, May 7–9, 2015, Conference Track Proceedings. ICLR, 2015 (cited on page 163).Google Scholar

Baker, James. ‘The DRAGON System—An Overview’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 23.1 (1975), pp. 24–29 (cited on pages 2, 3).Google Scholar

Bakir, Gükhan H. et al. Predicting Structured Data (Neural Information Processing). Cambridge, MA: MIT Press, 2007 (cited on page 4).Google Scholar

Baldi, P. and Hornik, K.. ‘Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima’. In: Neural Networks 2.1 (Jan. 1989), pp. 53–58. doi: 10.1016/0893-6080(89)90014-2 (cited on page 91).Google Scholar

Banerjee, Arindam et al. ‘Clustering on the Unit Hypersphere Using von Mises-Fisher Distributions’. In: Journal of Machine Learning Research 6 (Dec. 2005), pp. 1345–1382 (cited on page 379).Google Scholar

Barber, David. Bayesian Reasoning and Machine Learning. Cambridge, England: Cambridge University Press, 2012 (cited on pages 343, 357).Google Scholar

Bartholomew, David. Latent Variable Models and Factor Analysis. A Unified Approach. Chichester, England: Wiley, 2011 (cited on page 299).Google Scholar

Baum, Leonard E.. ‘An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes’. In: Inequalities 3 (1972), pp. 1–8 (cited on pages 276, 281).Google Scholar

Baum, Leonard E. and Petrie, Ted. ‘Statistical Inference for Probabilistic Functions of Finite State Markov Chains’. In: Annals of Mathematical Statistics 37.6 (Dec. 1966), pp. 1554–1563. doi: 10.1214/aoms/1177699147 (cited on page 276).CrossRef Google Scholar

Baum, Leonard E. et al. ‘A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains’. In: Annals of Mathematical Statistics 41.1 (Feb. 1970), pp. 164–171. doi: 10.1214/aoms/1177697196 (cited on pages 276, 281).Google Scholar

Bell, A. J. and Sejnowski, T. J.. ‘An Information Maximization Approach to Blind Separation and Blind Deconvolution.’ In: Neural Computation 7 (1995), pp. 1129–1159 (cited on pages 293, 294).Google Scholar

Ben-David, Shai et al. ‘A Theory of Learning from Different Domains’. In: Machine Learning 79.1–2 (May 2010), pp. 151–175. doi: 10.1007/s10994-009-5152-4 (cited on page 16).CrossRef Google Scholar

Berger, Adam L., Pietra, Stephen A. Della, and Pietra, Vincent J. Della. ‘A Maximum Entropy Approach to Natural Language Processing’. In: Computational Linguistics 22 (1996), pp. 39–71 (cited on page 254).Google Scholar

Bertsekas, Dimitri and Tsitsiklis, John. Introduction to Probability. Nashua, NH: Athena Scientific, 2002 (cited on page 40).Google Scholar

Bishop, Christopher M.. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. New York, NY: Springer, 2007 (cited on pages 343, 344, 350, 357, 368).Google Scholar

Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. ‘Latent Dirichlet Allocation’. In: Journal of Machine Learning Research 3 (Mar. 2003), pp. 993–1022 (cited on pages 363, 365, 366).Google Scholar

Bottou, Léon. ‘On-Line Learning and Stochastic Approximations’. In: On-Line Learning in Neural Networks. Ed. by Saad, D.. Cambridge, England: Cambridge University Press, 1998, pp. 9–42 (cited on page 61).Google Scholar

Bousquet, Olivier, Boucheron, Stéphane, and Lugosi, Gábor. ‘Introduction to Statistical Learning Theory’. In: Advanced Lectures on Machine Learning. Ed. by Bousquet, Olivier, von Luxburg, Ulrike, and Rätsch, Gunnar. Vol. 3176. Springer, 2003, pp. 169–207 (cited on pages 102, 103).Google Scholar

Box, G. E. P. and Tiao, G. C.. Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley, 1973 (cited on page 318).Google Scholar

Box, M. J., Davies, D., and Swann, W. H.. Non-Linear Optimisation Techniques. Edinburgh, Scotland: Oliver & Boyd, 1969 (cited on page 71).Google Scholar

Boyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge, England: Cambridge University Press, 2004 (cited on page 50).Google Scholar

Boyd, Stephen et al. ‘Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers’. In: Foundations and Trends in Machine Learning 3.1 (Jan. 2011), pp. 1–122. doi: 10.1561/2200000016 (cited on page 71).Google Scholar

Breiman, Leo. ‘Bagging Predictors’. In: Machine Learning 24.2 (1996), pp. 123–140 (cited on pages 204, 208).Google Scholar

Breiman, Leo. ‘Stacked Regressions’. In: Machine Learning 24.1 (July 1996), pp. 49–64. doi: 10.1023/A:1018046112532 (cited on page 204).Google Scholar

Breiman, Leo. ‘Prediction Games and Arcing Algorithms’. In: Neural Computation 11.7 (Oct. 1999), pp. 1493–1517. doi: 10.1162/089976699300016106 (cited on page 210).Google Scholar

Breiman, Leo. ‘Random Forests’. In: Machine Learning 45.1 (2001), pp. 5–32. doi: 10.1023/A:1010933404324 (cited on pages 208, 209).Google Scholar

Breiman, Leo et al. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks, 1984 (cited on pages 7, 205).Google Scholar

Bridle, John S.. ‘Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition’. In: Neurocomputing. Ed. by Soulié, Françoise Fogelman and Hérault, Jeanny. Berlin, Germany: Springer, 1990, pp. 227–236 (cited on pages 115, 159).Google Scholar

Bridle, John S.. ‘Training Stochastic Model Recognition Algorithms as Networks Can Lead to Maximum Mutual Information Estimation of Parameters’. In: Advances in Neural Information Processing Systems (NIPS). Vol. 2. San Mateo, CA: Morgan Kaufmann, 1990, pp. 211–217 (cited on pages 115, 159).Google Scholar

Brown, Peter, Lee, Chin-Hui, and Spohrer, J.. ‘Bayesian Adaptation in Speech Recognition’. In: ICASSP ‘83. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 8. Washington, D.C.: IEEE Computer Society, 1983, pp. 761–764 (cited on page 16).Google Scholar

Brown, Peter et al. ‘A Statistical Approach to Language Translation’. In: Proceedings of the 12th Conference on Computational Linguistics—Volume 1. COLING ‘88. Budapest, Hungary: Association for Computational Linguistics, 1988, pp. 71–76. doi: 10.3115/991635.991651 (cited on pages 2, 3).Google Scholar

Candès, E. J. and Wakin, M. B.. ‘An Introduction to Compressive Sampling’. In: IEEE Signal Processing Magazine 25.2 (2008), pp. 21–30 (cited on page 146).Google Scholar

Chaikin, P. M. and Lubensky, T. C.. Principles of Condensed Matter Physics. Cambridge, England: Cambridge University Press, 1995 (cited on page 327).Google Scholar

Chang, Chih-Chung and Lin, Chih-Jen. ‘LIBSVM: A Library for Support Vector Machines’. In: ACM Transactions on Intelligent Systems and Technology 2.3 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 27:1–27:27 (cited on page 125).Google Scholar

Chen, Tianqi and Guestrin, Carlos. ‘XGBoost: A Scalable Tree Boosting System’. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ed. by Krishnapuram, Balaji. New York, NY: Association for Computing Machinery, Aug. 2016. doi: 10.1145/2939672.2939785 (cited on page 215).Google Scholar

Cho, Kyunghyun et al. ‘Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.’ In: EMNLP. Ed. by Moschitti, Alessandro, Pang, Bo, and Daelemans, Walter. Stroudsburg, PA: Association for Computational Linguistics, 2014, pp. 1724–1734 (cited on page 171).Google Scholar

Cock, Dean. ‘Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project’. In: Journal of Statistics Education 19 (Nov. 2011). doi: 10.1080/10691898.2011.11889627 (cited on page 216).Google Scholar

Cortes, Corinna and Vapnik, Vladimir. ‘Support-Vector Networks’. In: Machine Learning 20.3 (Sept. 1995), pp. 273–297. doi: 10.1023/A:1022627411411 (cited on page 124).Google Scholar

Crammer, Koby and Singer, Yoram. ‘On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines’. In: Journal of Machine Learning Research 2 (Mar. 2002), pp. 265–292 (cited on page 127).Google Scholar

Cybenko, G.. ‘Approximation by Superpositions of a Sigmoidal Function’. In: Mathematics of Control, Signals, and Systems (MCSS) 2.4 (Dec. 1989), pp. 303–314. doi: 10.1007/BF02551274 (cited on page 154).Google Scholar

Dasarathy, B. V. and Sheela, B. V.. ‘A Composite Classifier System Design: Concepts and Methodology’. In: Proceedings of the IEEE. Vol. 67. Washington, D.C.: IEEE Computer Society, 1979, pp. 708–713 (cited on page 203).Google Scholar

Davis, Steven B. and Mermelstein, Paul. ‘Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences’. In: IEEE Transactions on Acoustics, Speech and Signal Processing 28.4 (1980), pp. 357–366 (cited on page 77).CrossRef Google Scholar

Deerwester, Scott et al. ‘Indexing by Latent Semantic Analysis’. In: Journal of the American Society for Information Science 41.6 (1990), pp. 391–407 (cited on page 142).Google Scholar

DeGroot, M. H.. Optimal Statistical Decisions. New York, NY: McGraw-Hill, 1970 (cited on page 318).Google Scholar

Dempster, A. P., Laird, N. M., and Rubin, D. B.. ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’. In: Journal of the Royal Statistical Society, Series B 39.1 (1977), pp. 1–38 (cited on pages 265, 315).Google Scholar

Dharmadhikari, S. W. and Jogdeo, Kumar. ‘Multivariate Unimodality’. In: Annals of Statistics 4.3 (May 1976), pp. 607–613. doi: 10.1214/aos/1176343466 (cited on page 239).Google Scholar

Domingos, Pedro. ‘A Few Useful Things to Know about Machine Learning’. In: Communications of the ACM 55.10 (Oct. 2012), pp. 78–87. doi: 10.1145/2347736.2347755 (cited on pages 14, 15).Google Scholar

Duchi, John, Hazan, Elad, and Singer, Yoram. ‘Adaptive Subgradient Methods for Online Learning and Stochastic Optimization’. In: Journal of Machine Learning Research 12 (July 2011), pp. 2121–2159 (cited on page 192).Google Scholar

Duda, Richard O. and Hart, Peter E.. Pattern Classification and Scene Analysis. New York, NY: John Wiley & Sons, 1973 (cited on page 2).Google Scholar

Duda, Richard O., Hart, Peter E., and Stork, David G.. Pattern Classification. 2nd ed. New York, NY: Wiley, 2001 (cited on pages 7, 11, 226).Google Scholar

Elahi, Mehdi, Ricci, Francesco, and Rubens, Neil. ‘A Survey of Active Learning in Collaborative Filtering Recommender Systems’. In: Computer Science Review 20.C (May 2016), pp. 29–50. doi: 10.1016/j.cosrev.2016.05.002 (cited on page 17).Google Scholar

Everitt, B. and Hand, D. J.. Finite Mixture Distributions. Monographs on Applied Probability and Statistics. New York, NY: Springer, 1981 (cited on page 257).Google Scholar

Fahlman, Scott E.. An Empirical Study of Learning Speed in Back-Propagation Networks. Tech. rep. CMU-CS-88-162. Pittsburgh, PA: Computer Science Department, Carnegie Mellon University, 1988 (cited on page 63).Google Scholar

Ferguson, Thomas S.. ‘A Bayesian Analysis of Some Nonparametric Problems’. In: The Annals of Statistics 1 (1973), pp. 209–230 (cited on page 333).Google Scholar

Finkelstein, Lev et al. ‘Placing Search in Context: The Concept Revisited’. In: Proceedings of the 10th International Conference on World Wide Web. New York, NY: Association for Computing Machinery, 2001, pp. 406–414. doi: 10.1145/503104.503110 (cited on page 149).Google Scholar

Fiscus, Jonathan. ‘A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER)’. In: IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. Washington, D.C.: IEEE Computer Society, Aug. 1997, pp. 347–354 (cited on page 203).Google Scholar

Fisher, R. A.. ‘The Use of Multiple Measurements in Taxonomic Problems’. In: Annals of Eugenics 7.7 (1936), pp. 179–188 (cited on page 85).Google Scholar

Fletcher, R.. Practical Methods of Optimization. 2nd ed. Hoboken, NJ: Wiley-Interscience, 1987 (cited on page 63).Google Scholar

Forgy, E.. ‘Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification’. In: Biometrics 21.3 (1965), pp. 768–769 (cited on pages 5, 270).Google Scholar

Foucart, Simon and Rauhut, Holger. A Mathematical Introduction to Compressive Sensing. Basel, Switzerland: Birkhäuser, 2013 (cited on page 146).Google Scholar

Freund, Yoav and Schapire, Robert E. ‘A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting’. In: Journal of Computer and System Sciences 55.1 (Aug. 1997), pp. 119–139. doi: 10.1006/jcss.1997.1504 (cited on pages 204, 210, 214).Google Scholar

Freund, Yoav and Schapire, Robert E.. ‘Large Margin Classification Using the Perceptron Algorithm’. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. COLT’ 98. Madison, Wisconsin: ACM, 1998, pp. 209–217. doi: 10.1145/279943.279985 (cited on page 111).Google Scholar

Frey, Brendan J.. Graphical Models for Machine Learning and Digital Communication. Cambridge, MA: MIT Press, 1998 (cited on page 357).CrossRef Google Scholar

Frey, Brendan J. and MacKay, David J. C.. ‘A Revolution: Belief Propagation in Graphs with Cycles’. In: Advances in Neural Information Processing Systems 10. Ed. by Jordan, M. I., Kearns, M. J., and Solla, S. A.. Cambridge, MA: MIT Press, 1998, pp. 479–485 (cited on page 357).Google Scholar

Friedman, Jerome H.. ‘Greedy Function Approximation: A Gradient Boosting Machine’. In: Annals of Statistics 29 (2000), pp. 1189–1232 (cited on pages 210, 211, 215).Google Scholar

Friedman, Jerome H.. ‘Stochastic Gradient Boosting’. In: Computational Statistics and Data Analysis 38.4 (Feb. 2002), pp. 367–378. doi: 10.1016/S0167-9473(01)00065-2 (cited on pages 211, 215).Google Scholar

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Rob. ‘Additive Logistic Regression: a Statistical View of Boosting’. In: The Annals of Statistics 38.2 (2000) (cited on pages 211, 212, 215).Google Scholar

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Rob. ‘Regularization Paths for Generalized Linear Models via Coordinate Descent’. In: Journal of Statistical Software 33.1 (2010), pp. 1–22. doi: 10.18637/jss.v033.i01 (cited on page 140).Google Scholar

Fukushima, Kunihiko. ‘Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position’. In: Biological Cybernetics 36 (1980), pp. 193–202 (cited on page 157).Google Scholar

Gauvain, J. and Lee, Chin-Hui. ‘Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains’. In: IEEE Transactions on Speech and Audio Processing 2.2 (1994), pp. 291– 298 (cited on page 16).Google Scholar

Geisser, S.. Predictive Inference: An Introduction. New York, NY: Chapman & Hall, 1993 (cited on page 314).CrossRef Google Scholar

Ghahramani, Zoubin. Non-Parametric Bayesian Methods. 2005. url: http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf (visited on 03/10/2020) (cited on page 335).Google Scholar

Glick, Ned. ‘Sample-Based Classification Procedures Derived from Density Estimators’. In: Journal of the American Statistical Association 67 (1972), pp. 116–122 (cited on pages 229, 230).Google Scholar

Glick, Ned. ‘Sample-Based Classification Procedures Related to Empiric Distributions’. In: IEEE Transactions on Information Theory 22 (1976), pp. 454–461 (cited on page 229).Google Scholar

Glorot, Xavier and Bengio, Yoshua. ‘Understanding the Difficulty of Training Deep Feedforward Neural Networks’. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010, pp. 249–256 (cited on pages 153, 190).Google Scholar

Good, I. J.. ‘The Population Frequencies of Species and the Estimation of Population Parameters’. In: Biometrika 40.3–4 (Dec. 1953), pp. 237–264. doi: 10.1093/biomet/40.3-4.237 (cited on page 250).Google Scholar

Goodfellow, Ian et al. ‘Generative Adversarial Nets’. In: Advances in Neural Information Processing Systems 27. Ed. by Ghahramani, Z. et al. Red Hook, NY: Curran Associates, Inc., 2014, pp. 2672–2680 (cited on pages 293–295, 307, 308).Google Scholar

Gregor, Karol et al. ‘DRAW: A Recurrent Neural Network for Image Generation’. In: Proceedings of the 32nd International Conference on Machine Learning. Ed. by Bach, Francis and Blei, David. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 1462–1471 (cited on page 295).Google Scholar

Grezl, F. et al. ‘Probabilistic and Bottle-Neck Features for LVCSR of Meetings’. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 4. Washington, D.C.: IEEE Computer Society, 2007, pp. 757–760 (cited on page 91).Google Scholar

Gruber, M. H. J.. Improving Efficiency by Shrinkage: The James–Stein and Ridge Regression Estimators. Boca Raton, FL: CRC Press, 1998, pp. 7–15 (cited on page 139).Google Scholar

Guyon, Isabelle and Elisseeff, André. ‘An Introduction to Variable and Feature Selection’. In: Journal of Machine Learning Research 3 (Mar. 2003), pp. 1157–1182 (cited on page 78).Google Scholar

Haff, L. R.. ‘An Identity for the Wishart Distribution with Applications’. In: Journal of Multivariate Analysis 9.4 (Dec. 1979), pp. 531–544 (cited on page 322).Google Scholar

Hansen, L. K. and Salamon, P.. ‘Neural Network Ensembles’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 12.10 (Oct. 1990), pp. 993–1001. doi: 10.1109/34.58871 (cited on page 203).Google Scholar

Harris, Zellig. ‘Distributional Structure’. In: Word 10.23 (1954), pp. 146–162 (cited on pages 5, 77, 142).Google Scholar

Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY: Springer, 2001 (cited on pages 138, 205, 207).Google Scholar

Hellman, Martin E. and Raviv, Josef. ‘Probability of Error, Equivocation and the Chernoff Bound’. In: IEEE Transactions on Information Theory 16 (1970), pp. 368–372 (cited on page 226).Google Scholar

Hermansky, H., Ellis, D. P. W., and Sharma, S.. ‘Tandem Connectionist Feature Extraction for Conventional HMM Systems’. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. Vol. 3. Washington, D.C.: IEEE Computer Society, 2000, pp. 1635–1638 (cited on page 91).Google Scholar

Hihi, Salah El and Bengio, Yoshua. ‘Hierarchical Recurrent Neural Networks for Long-Term Dependencies’. In: Advances in Neural Information Processing Systems 8. Ed. by Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E.. Cambridge, MA: MIT Press, 1996, pp. 493–499 (cited on page 171).Google Scholar

Hinton, Geoffrey E.. ‘Training Products of Experts by Minimizing Contrastive Divergence’. In: Neural Computation 14.8 (2002), pp. 1771–1800. doi: 10.1162/089976602760128018 (cited on page 371).Google Scholar

Hinton, Geoffrey E.. ‘A Practical Guide to Training Restricted Boltzmann Machines.’ In: Neural Networks: Tricks of the Trade. Ed. by Montavon, Grégoire, Orr, Genevieve B., and Müller, Klaus-Robert. 2nd ed. Vol. 7700. New York, NY: Springer, 2012, pp. 599–619 (cited on pages 366, 370).Google Scholar

Hinton, Geoffrey and Roweis, Sam. ‘Stochastic Neighbor Embedding’. In: Advances in Neural Information Processing Systems. Ed. by Thrun, S. Becker, S. and Obermayer, K.. Vol. 15. Cambridge, MA: MIT Press, 2003, pp. 833–840 (cited on page 89).Google Scholar

Kam Ho, Tin. ‘Random Decision Forests’. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). ICDAR ‘95. Washington, D.C.: IEEE Computer Society, 1995, p. 278 (cited on pages 208, 209).Google Scholar

Ho, Tin Kam, Hull, Jonathan J., and Srihari, Sargur N.. ‘Decision Combination in Multiple Classifier Systems’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 16.1 (Jan. 1994), pp. 66–75. doi: 10.1109/34.273716 (cited on page 203).Google Scholar

Hochreiter, Sepp and Schmidhuber, Jürgen. ‘Long Short-Term Memory’. In: Neural Computation 9.8 (Nov. 1997), pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735 (cited on page 171).Google Scholar

Hornik, Kurt. ‘Approximation Capabilities of Multilayer Feedforward Networks’. In: Neural Networks 4.2 (Mar. 1991), pp. 251–257. doi: 10.1016/0893-6080(91)90009-T (cited on pages 154, 155).Google Scholar

Hotelling, H.. ‘Analysis of a Complex of Statistical Variables into Principal Components.’ In: Journal of Educational Psychology 24.6 (1933), pp. 417–441. doi: 10.1037/h0071325 (cited on page 80).Google Scholar

Huo, Qiang. ‘An Introduction to Decision Rules for Automatic Speech Recognition’. In: Technical Report TR-99-07. Hong Kong: Department of Computer Science and Information Systems, University of Hong Kong, 1999 (cited on page 229).Google Scholar

Huo, Qiang and Lee, Chin-Hui. ‘On-Line Adaptive Learning of the Continuous Density Hidden Markov Model Based on Approximate Recursive Bayes Estimate’. In: IEEE Transactions on Speech and Audio Processing 5.2 (1997), pp. 161–172 (cited on page 17).Google Scholar

Hussein, Ahmed et al. ‘Imitation Learning: A Survey of Learning Methods’. In: ACM Computing Surveys 50.2 (Apr. 2017). doi: 10.1145/3054912 (cited on page 17).Google Scholar

Hyvärinen, Aapo and Oja, Erkki. ‘Independent Component Analysis: Algorithms and Applications’. In: Neural Networks 13 (2000), pp. 411–430 (cited on pages 293, 294, 301).Google Scholar

Ioffe, Sergey and Szegedy, Christian. ‘Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift’. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37. ICML’15. Lille, France: Journal of Machine Learning Research, 2015, pp. 448–456 (cited on page 160).Google Scholar

Jaakkola, Tommi S. and Jordan, Michael I.. A Variational Approach to Bayesian Logistic Regression Models and Their Extensions. 1996. url: https://people.csail.mit.edu/tommi/papers/aistat96.ps (visited on 11/10/2019) (cited on page 326).Google Scholar

Jackson, Peter. Introduction to Expert Systems. 2nd ed. USA: Addison-Wesley Longman Publishing Co., Inc., 1990 (cited on page 2).Google Scholar

Jarrett, Kevin et al. ‘What Is the Best Multi-Stage Architecture for Object Recognition?’ In: 2009 IEEE 12th International Conference on Computer Vision. Washington, D.C.: IEEE Computer Society, 2009, pp. 2146–2153 (cited on page 153).Google Scholar

Jelinek, F., Bahl, L. R., and Mercer, R. L.. ‘Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech’. In: IEEE Transactions on Information Theory 21 (1975), pp. 250–256 (cited on pages 2, 3).Google Scholar

Jensen, Finn V.. Introduction to Bayesian Networks. 1st ed. Berlin, Germany: Springer-Verlag, 1996 (cited on page 343).Google Scholar

Jensen, J. L. W. V.. ‘Sur les fonctions convexes et les inégalités entre les valeurs moyennes’. In: Acta Mathematica 30.1 (1906), pp. 175–193 (cited on page 46).Google Scholar

Jiang, Hui. ‘A New Perspective on Machine Learning: How to Do Perfect Supervised Learning’. In: CoRR abs/1901.02046 (2019) (cited on page 13).Google Scholar

Johnson, Richard Arnold and Wichern, Dean W.. Applied Multivariate Statistical Analysis. 5th ed. Upper Saddle River, NJ: Prentice Hall, 2002 (cited on page 378).Google Scholar

Jones, Karen Spärck. ‘A Statistical Interpretation of Term Specificity and Its Application in Retrieval’. In: Journal of Documentation 28 (1972), pp. 11–21 (cited on page 78).Google Scholar

Jordan, Michael I., ed. Learning in Graphical Models. Cambridge, MA: MIT Press, 1999 (cited on page 343).Google Scholar

Jordan, Michael I. et al. ‘An Introduction to Variational Methods for Graphical Models’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 105–161. doi: 10.1007/978-94-011-5014-9_5 (cited on page 357).Google Scholar

Juang, B. H.. ‘Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains’. In: AT&T Technical Journal 64.6 (July 1985), pp. 1235–1249. doi: 10.1002/j.1538-7305.1985.tb00273.x (cited on page 284).Google Scholar

Juang, B. H. and Rabiner, L. R.. ‘The Segmental K-Means Algorithm for Estimating Parameters of Hidden Markov Models’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 38.9 (Sept. 1990), pp. 1639–1641. doi: 10.1109/29.60082 (cited on page 286).Google Scholar

Emil Kalman, Rudolph. ‘A New Approach to Linear Filtering and Prediction Problems’. In: Journal of Basic Engineering 82.1 (1960), pp. 35–45 (cited on page 69).Google Scholar

Karras, Tero, Laine, Samuli, and Aila, Timo. ‘A Style-Based Generator Architecture for Generative Adversarial Networks.’ In: CoRR abs/1812.04948 (2018) (cited on page 295).Google Scholar

Karush, William. ‘Minima of Functions of Several Variables with Inequalities as Side Conditions’. MA thesis. Chicago, IL: Department of Mathematics, University of Chicago, 1939 (cited on page 57).Google Scholar

Katz, Slava M.. ‘Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer’. In: IEEE Transactions on Acoustics, Speech and Signal Processing. 1987, pp. 400–401 (cited on page 250).Google Scholar

Kechris, Alexander S.. Classical Descriptive Set Theory. Berlin, Germany: Springer-Verlag, 1995 (cited on page 291).Google Scholar

Kendall, M. G., Stuart, A., and Ord, J. K.. Kendall's Advanced Theory of Statistics. Oxford, England: Oxford University Press, 1987 (cited on page 323).Google Scholar

Kinderman, R. and Snell, S. L.. Markov Random Fields and Their Applications. Ann Arbor, MI: American Mathematical Society, 1980 (cited on pages 344, 366).Google Scholar

Kingma, Diederik P. and Ba, Jimmy. ‘ADAM: AMethod for Stochastic Optimization.’ In: CoRR abs/1412.6980 (2014) (cited on page 192).Google Scholar

Kingma, Diederik P. and Welling, Max. ‘Auto-Encoding Variational Bayes’. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. ICLR, 2014 (cited on pages 293, 294, 305, 306).Google Scholar

Koren, Yehuda, Bell, Robert, and Volinsky, Chris. ‘Matrix Factorization Techniques for Recommender Systems’. In: Computer 42.8 (Aug. 2009), pp. 30–37. doi: 10.1109/MC.2009.263 (cited on page 143).Google Scholar

Kramer, Mark A.. ‘Nonlinear Principal Component Analysis Using Autoassociative Neural Networks’. In: AIChE Journal 37.2 (1991), pp. 233–243. doi: 10.1002/aic.690370209 (cited on page 90).Google Scholar

Krogh, Anders and Hertz, John A.. ‘A Simple Weight Decay Can Improve Generalization’. In: Advances in Neural Information Processing Systems 4. Ed. by Moody, J. E., Hanson, S. J., and Lippmann, R. P.. Burlington, MA: Morgan-Kaufmann, 1992, pp. 950–957 (cited on page 194).Google Scholar

Kschischang, F. R., Frey, B. J., and Loeliger, H. A.. ‘Factor Graphs and the Sum-Product Algorithm’. In: IEEE Transactions on Information Theory 47.2 (Sept. 2006), pp. 498–519. doi: 10.1109/18.910572 (cited on pages 357, 360).Google Scholar

Kuhn, H. W. and Tucker, A. W.. ‘Nonlinear Programming’. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press, 1951, pp. 481–492 (cited on page 57).Google Scholar

Kulis, Brian. ‘Metric Learning: A Survey’. In: Foundations and Trends in Machine Learning 5.4 (2013), pp. 287–364. doi: 10.1561/2200000019 (cited on page 13).Google Scholar

Kullback, S. and Leibler, R. A.. ‘On Information and Sufficiency’. In: Annals of Mathematical Statistics 22.1 (1951), pp. 79–86 (cited on page 41).Google Scholar

Lafferty, John D., McCallum, Andrew, and Pereira, Fernando C. N.. ‘Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data’. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML ‘01. San Francisco, CA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289 (cited on pages 366, 368, 369).Google Scholar

Simon Laplace, Pierre. ‘Memoir on the Probability of the Causes of Events’. In: Statistical Science 1.3 (1986), pp. 364–378 (cited on page 324).Google Scholar

Lauritzen, S. L. and Spiegelhalter, D. J.. ‘Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems’. In: Journal of the Royal Statistical Society. Series B (Methodological) 50.2 (1988), pp. 157–224 (cited on pages 357, 361).Google Scholar

LeCun, Yann and Bengio, Yoshua. ‘Convolutional Networks for Images, Speech, and Time Series’. In: The Handbook of Brain Theory and Neural Networks. Ed. by Arbib, Michael A.. Cambridge, MA: MIT Press, 1998, pp. 255–258 (cited on page 157).Google Scholar

LeCun, Yann et al. ‘Gradient-Based Learning Applied to Document Recognition’. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324 (cited on pages 92, 129, 200).Google Scholar

Lee, Chin-Hui and Huo, Qiang. ‘On Adaptive Decision Rules and Decision Parameter Adaptation for Automatic Speech Recognition’. In: Proceedings of the IEEE 88.8 (2000), pp. 1241–1269 (cited on page 16).Google Scholar

Leggetter, C. J. and Woodland, P. C.. ‘Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models’. In: Computer Speech & Language 9.2 (1995), pp. 171–185. doi: https://doi.org/10.1006/csla.1995.0010 (cited on page 16).Google Scholar

Linnainmaa, Seppo. ‘Taylor Expansion of the Accumulated Rounding Error’. In: BIT Numerical Mathematics 16.2 (June 1976), pp. 146–160. doi: 10.1007/BF01931367 (cited on page 176).Google Scholar

Liu, Quan et al. ‘Learning Semantic Word Embeddings Based on Ordinal Knowledge Constraints’. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, July 2015, pp. 1501–1511. doi: 10.3115/v1/P15-1145 (cited on page 149).Google Scholar

Lloyd, Stuart P.. ‘Least Squares Quantization in PCM’. In: IEEE Transactions on Information Theory 28 (1982), pp. 129–137 (cited on page 270).Google Scholar

Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. ‘Fully Convolutional Networks for Semantic Segmentation’. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington, D.C.: IEEE Computer Society, June 2015 (cited on pages 198, 309).Google Scholar

Lowe, David G.. ‘Object Recognition from Local Scale-Invariant Features’. In: Proceedings of the International Conference on Computer Vision. ICCV ‘99. Washington, D.C.: IEEE Computer Society, 1999, p. 1150 (cited on page 77).Google Scholar

van der Maaten, Laurens and Hinton, Geoffrey. ‘Visualizing Data Using t-SNE’. In: Journal of Machine Learning Research 9 (2008), pp. 2579–2605 (cited on page 89).Google Scholar

MacKay, David J. C.. ‘The Evidence Framework Applied to Classification Networks’. In: Neural Computation 4.5 (1992), pp. 720–736. doi: 10.1162/neco.1992.4.5.720 (cited on page 326).Google Scholar

MacKay, David J. C.. ‘Introduction to Gaussian Processes’. In: Neural Networks and Machine Learning. Ed. by Bishop, C. M.. NATO ASI Series. Amsterdam, Netherlands: Kluwer Academic Press, 1998, pp. 133–166 (cited on page 333).Google Scholar

MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms. Cambridge, England: Cambridge University Press, 2003 (cited on page 324).Google Scholar

MacKay, David J. C.. ‘Good Error-Correcting Codes Based on Very Sparse Matrices’. In: IEEE Transactions on Information Theory 45.2 (Sept. 2006), pp. 399–431. doi: 10.1109/18.748992 (cited on page 357).Google Scholar

Mackay, David J. C.. ‘Introduction to Monte Carlo Methods’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 175–204. doi: 10.1007/978-94-011-5014-9_7 (cited on pages 357, 361).Google Scholar

Mahoney, Matt. Large Text Compression Benchmark. 2011. url: http://mattmahoney.net/dc/textdata.html (visited on 11/10/2019) (cited on page 149).Google Scholar

Mairal, Julien et al. ‘Online Learning for Matrix Factorization and Sparse Coding’. In: Journal of Machine Learning Research 11 (Mar. 2010), pp. 19–60 (cited on page 145).Google Scholar

Maritz, J. S. and Lwin, T.. Empirical Bayes Methods. London, England: Chapman & Hall, 1989 (cited on page 323).Google Scholar

Maron, M. E.. ‘Automatic Indexing: An Experimental Inquiry’. In: Journal of the ACM 8.3 (July 1961), pp. 404–417. doi: 10.1145/321075.321084 (cited on page 362).CrossRef Google Scholar

Martens, James. ‘Deep Learning via Hessian-Free Optimization’. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 735–742 (cited on page 63).Google Scholar

Mason, Llew et al. ‘Boosting Algorithms as Gradient Descent’. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. NIPS’99. Denver, CO: MIT Press, 1999, pp. 512–518 (cited on pages 210, 212).Google Scholar

McLachlan, G. J. and Peel, D.. Finite Mixture Models. New York, NY: Wiley, 2000 (cited on page 257).Google Scholar

Mead, A.. ‘Review of the Development of Multidimensional Scaling Methods’. In: Journal of the Royal Statistical Society. Series D (The Statistician) 41.1 (1992), pp. 27–39 (cited on page 88).Google Scholar

Minka, T. P.. ‘Expectation Propagation for Approximate Bayesian Inference’. In: Uncertainty in Artificial Intelligence. Vol. 17. Association for Uncertainty in Artificial Intelligence, 2001, pp. 362–369 (cited on page 357).Google Scholar

Mitchell, Tom M.. Machine Learning. New York, NY: McGraw-Hill, 1997 (cited on page 2).Google Scholar

Mnih, Volodymyr et al. ‘Playing Atari with Deep Reinforcement Learning’. In: arXiv (2013). arXiv:1312.5602 (cited on page 15).Google Scholar

Mnih, Volodymyr et al. ‘Human-Level Control through Deep Reinforcement Learning’. In: Nature 518.7540 (Feb. 2015), pp. 529–533 (cited on page 16).Google Scholar

Nair, Vinod and Hinton, Geoffrey E.. ‘Rectified Linear Units Improve Restricted Boltzmann Machines’. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). ICML, 2010, pp. 807–814 (cited on page 153).Google Scholar

Neal, Radford M.. ‘Bayesian Mixture Modeling’. In: Maximum Entropy and Bayesian Methods: Seattle, 1991. Ed. by Smith, C. Ray, Erickson, Gary J., and Neudorfer, Paul O.. Dordrecht, Netherlands: Springer, 1992, pp. 197–211. doi: 10.1007/978-94-017-2219-3_14 (cited on page 333).Google Scholar

Neal, Radford M. and Hinton, Geoffrey E.. ‘A View of the EM Algorithm That Justifies Incremental, Sparse, and Other Variants’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 355–368. doi: 10.1007/978-94-011-5014-9_12 (cited on page 327).Google Scholar

Nelder, J. A. and Wedderburn, R. W. M.. ‘Generalized Linear Models’. In: Journal of the Royal Statistical Society, Series A, General 135 (1972), pp. 370–384 (cited on pages 239, 250).Google Scholar

Nesterov, Yurii. Introductory Lectures on Convex Optimization: A Basic Course. 1st ed. New York, NY: Springer, 2014 (cited on pages 49, 50).Google Scholar

Ney, H. and Ortmanns, S.. ‘Progress in Dynamic Programming Search for LVCSR’. In: Proceedings of the IEEE 88.8 (Aug. 2000), pp. 1224–1240. doi: 10.1109/5.880081 (cited on pages 276, 280).Google Scholar

Ng, Andrew. Machine Learning Yearning. 2018. url: http://www.deeplearning.ai/machine-learning-yearning/ (visited on 12/10/2019) (cited on page 196).Google Scholar

Nocedal, Jorge and Wright, Stephen J.. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York, NY: Springer, 2006, pp. XXII, 664 (cited on page 63).Google Scholar

Novikoff, A. B.. ‘On Convergence Proofs on Perceptrons’. In: Proceedings of the Symposium on the Mathematical Theory of Automata. Vol. 12. New York, NY: Polytechnic Institute of Brooklyn, 1962, pp. 615–622 (cited on page 108).Google Scholar

Olah, Christopher. Understanding LSTM Networks. 2015. url: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (visited on 11/10/2019) (cited on page 171).Google Scholar

van den Oord, Aäron et al. ‘WaveNet: A Generative Model for Raw Audio’. In: CoRR abs/1609.03499 (2016) (cited on page 198).Google Scholar

Opitz, David and Maclin, Richard. ‘Popular Ensemble Methods: An Empirical Study’. In: Journal of Artificial Intelligence Research 11.1 (July 1999), pp. 169–198 (cited on page 203).Google Scholar

Pearl, Judea. ‘Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach’. In: Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA: Association for the Advancement of Artificial Intelligence, 1982, pp. 133–136 (cited on page 357).Google Scholar

Pearl, Judea. ‘Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning’. In: Proceedings of the Cognitive Science Society (CSS-7). 1985 (cited on page 343).Google Scholar

Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1988 (cited on pages 343, 350, 357).Google Scholar

Pearl, Judea. ‘Causal Inference in Statistics: An Overview’. In: Statistics Surveys 3 (Jan. 2009), pp. 96–146. doi: 10.1214/09-SS057 (cited on pages 16, 347).Google Scholar

Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge, MA: Cambridge University Press, 2009 (cited on pages 16, 347).Google Scholar

Pearson, Karl. ‘On Lines and Planes of Closest Fit to Systems of Points in Space’. In: Philosophical Magazine 2 (1901), pp. 559–572 (cited on page 80).Google Scholar

Peters, Jonas, Janzing, Dominik, and Schlkopf, Bernhard. Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA: MIT Press, 2017 (cited on pages 16, 347).Google Scholar

Plataniotis, K. N. and Hatzinakos, D.. ‘Gaussian Mixtures and Their Applications to Signal Processing’. In: Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real Time Systems. Ed. by Stergiopoulos, Stergios. Boca Raton, FL: CRC Press, 2000, Chapter 3 (cited on page 268).Google Scholar

Platt, John C.. ‘Fast Training of Support Vector Machines Using Sequential Minimal Optimization’. In: Advances in Kernel Methods. Ed. by Schölkopf, Bernhard, Burges, Christopher J. C., and Smola, Alexander J.. Cambridge, MA: MIT Press, 1999, pp. 185–208 (cited on page 127).Google Scholar

Platt, John C., Cristianini, Nello, and Shawe-Taylor, John. ‘Large Margin DAGs for Multiclass Classification’. In: Advances in Neural Information Processing Systems 12. Ed. by Solla, S. A., Leen, T. K., and Müller, K.. Cambridge, MA: MIT Press, 2000, pp. 547–553 (cited on page 127).Google Scholar

Pratt, L. Y.. ‘Discriminability-Based Transfer between Neural Networks’. In: Advances in Neural Information Processing Systems 5. Ed. by Hanson, S. J., Cowan, J. D., and Giles, C. L.. Burlington, MA: Morgan-Kaufmann, 1993, pp. 204–211 (cited on page 16).Google Scholar

Press, S. James. Applied Multivariate Analysis. 2nd ed. Malabar, FL: R. E. Krieger, 1982 (cited on page 378).Google Scholar

Qian, Ning. ‘On the Momentum Term in Gradient Descent Learning Algorithms’. In: Neural Networks 12.1 (Jan. 1999), pp. 145–151. doi: 10.1016/S0893-6080(98)00116-6 (cited on page 192).Google Scholar

Quinlan, J. R.. ‘Induction of Decision Trees’. In: Machine Learning 1.1 (Mar. 1986), pp. 81–106. doi: 10.1023/A:1022643204877 (cited on page 205).Google Scholar

Rabiner, Lawrence R.. ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’. In: Proceedings of the IEEE 77.2 (1989), pp. 257–286 (cited on pages 276, 357).Google Scholar

Rai, Piyush. Matrix Factorization and Matrix Completion. 2016. url: https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec14_slides.pdf (visited on 11/10/2019) (cited on page 144).Google Scholar

Rasmussen, Carl Edward and Williams, Christopher K. I.. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2005 (cited on pages 333, 339).Google Scholar

Ricci, Francesco, Rokach, Lior, and Shapira, Bracha. ‘Introduction to Recommender Systems Handbook’. In: Recommender Systems Handbook. Ed. by Ricci, Francesco et al. Boston, MA: Springer, 2011, pp. 1–35. doi: 10.1007/978-0-387-85820-3_1 (cited on page 141).Google Scholar

Rissanen, Jorma. ‘Modeling by Shortest Data Description.’ In: Automatica 14.5 (1978), pp. 465–471 (cited on page 11).Google Scholar

Rocca, Joseph. Understanding Variational Autoencoders (VAEs). 2019. url: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73 (visited on 03/03/2020) (cited on page 306).Google Scholar

Rosenblatt, F.. ‘The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain’. In: Psychological Review (1958), pp. 65–386 (cited on pages 2, 108).Google Scholar

Roweis, Sam T. and Saul, Lawrence K.. ‘Nonlinear Dimensionality Reduction by Locally Linear Embedding’. In: Science 290.5500 (2000), pp. 2323–2326. doi: 10.1126/science.290.5500.2323 (cited on page 87).Google Scholar

Rubinstein, R., Bruckstein, A. M., and Elad, M.. ‘Dictionaries for Sparse Representation Modeling’. In: Proceedings of the IEEE 98.6 (June 2010), pp. 1045–1057. doi: 10.1109/JPROC.2010.2040551 (cited on page 145).Google Scholar

Rue, Havard and Held, Leonhard. Gaussian Markov Random Fields: Theory and Applications (Monographs on Statistics and Applied Probability). Boca Raton, FL: Chapman & Hall/CRC, 2005 (cited on pages 344, 366).Google Scholar

Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J.. ‘Learning Representations by Back-Propagating Errors’. In: Nature 323.6088 (1986), pp. 533–536. doi: 10.1038/323533a0 (cited on pages 153, 176).Google Scholar

Rumelhart, David E., McClelland, James L., and et al., eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2: Psychological and Biological Models. Cambridge, MA: MIT Press, 1986 (cited on page 2).Google Scholar

Rumelhart, David E., McClelland, James L., and PDP Research Group, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986 (cited on page 2).Google Scholar

Russell, Stuart and Norvig, Peter. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2010 (cited on pages 1, 2).Google Scholar

Saha, Sumit. A Comprehensive Guide to Convolutional Neural Networks. 2018. url: http://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 (visited on 11/10/2019) (cited on page 169).Google Scholar

Salimans, Tim and Kingma, Diederik P.. ‘Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks’. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. Barcelona, Spain: Curran Associates Inc., 2016, pp. 901–909 (cited on pages 194, 195).Google Scholar

Samir, Mostafa. Machine Learning Theory—Part 2: Generalization Bounds. 2016. url: https://mostafa-samir.github.io/ml-theory-pt2/ (visited on 11/10/2019) (cited on page 103).Google Scholar

Sammon, John W.. ‘A Nonlinear Mapping for Data Structure Analysis’. In: IEEE Transactions on Computers 18.5 (1969), pp. 401–409 (cited on page 88).Google Scholar

Samuel, A. L.. ‘Some Studies in Machine Learning Using the Game of Checkers’. In: IBM Journal of Research and Development 3.3 (July 1959), pp. 210–229. doi: 10.1147/rd.33.0210 (cited on page 2).Google Scholar

Saul, Lawrence K., Jaakkola, Tommi, and Jordan, Michael I.. ‘Mean Field Theory for Sigmoid Belief Networks’. In: Journal of Artificial Intelligence Research 4 (1996), pp. 61–76 (cited on page 326).Google Scholar

Schapire, Robert E.. ‘The Strength of Weak Learnability’. In: Machine Learning 5.2 (1990), pp. 197–227. doi: 10.1023/A:1022648800760 (cited on pages 204, 209, 210).Google Scholar

Schapire, Robert E. et al. ‘Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods’. In: Proceedings of the Fourteenth International Conference on Machine Learning. ICML ‘97. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1997, pp. 322–330 (cited on pages 204, 214).Google Scholar

Schölkopf, Bernhard, Smola, Alexander, and Müller, Klaus-Robert. ‘Nonlinear Component Analysis as a Kernel Eigenvalue Problem’. In: Neural Computation 10.5 (July 1998), pp. 1299–1319. doi: 10.1162/089976698300017467 (cited on page 125).Google Scholar

Schuster, M. and Paliwal, K. K.. ‘Bidirectional Recurrent Neural Networks’. In: IEEE Transactions on Signal Processing 45.11 (Nov. 1997), pp. 2673–2681. doi: 10.1109/78.650093 (cited on page 171).Google Scholar

Seide, Frank, Li, Gang, and Yu, Dong. ‘Conversational Speech Transcription Using Context-Dependent Deep Neural Networks’. In: Proceedings of Interspeech. Baixas, France: International Speech Communication Association, 2011, pp. 437–440 (cited on page 276).Google Scholar

Settles, Burr. Active Learning Literature Survey. Computer Sciences Technical Report 1648. Madison, WI: University of Wisconsin–Madison, 2009 (cited on page 17).Google Scholar

Shalev-Shwartz, Shai and Ben-David, Shai. Understanding Machine Learning: From Theory to Algorithms. Cambridge, England: Cambridge University Press, 2014 (cited on pages 11, 14).Google Scholar

Shalev-Shwartz, Shai and Singer, Yoram. ‘A New Perspective on an Old Perceptron Algorithm’. In: International Conference on Computational Learning Theory. New York, NY: Springer, 2005, pp. 264–278 (cited on page 111).Google Scholar

Shannon, C. E.. ‘A Mathematical Theory of Communication’. In: Bell System Technical Journal 27.3 (1948), pp. 379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x (cited on page 41).Google Scholar

Shor, N. Z., Kiwiel, Krzysztof C., and Ruszcayński, Andrzej. Minimization Methods for Non-Differentiable Functions. Berlin, Germany: Springer-Verlag, 1985 (cited on page 71).Google Scholar

Silver, David et al. ‘Mastering the Game of Go with Deep Neural Networks and Tree Search’. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961 (cited on page 16).Google Scholar

Slater, Morton. Lagrange Multipliers Revisited. Cowles Foundation Discussion Papers 80. New Haven, CT: Cowles Foundation for Research in Economics, Yale University, 1959 (cited on page 57).Google Scholar

Smolensky, P.. ‘Information Processing in Dynamical Systems: Foundations of Harmony Theory’. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. Ed. by Rumelhart, David E., McClelland, James L., and PDP Research Group. Cambridge, MA: MIT Press, 1986, pp. 194–281 (cited on pages 366, 370).Google Scholar

Sollich, Peter and Krogh, Anders. ‘Learning with Ensembles: How Overfitting Can Be Useful.’ In: Advances in Neural Information Processing Systems 7. Ed. by Touretzky, David S., Mozer, Michael, and Hasselmo, Michael E.. Cambridge, MA: MIT Press, 1995, pp. 190–196 (cited on page 203).Google Scholar

Soltani, Rohollah and Jiang, Hui. ‘Higher Order Recurrent Neural Networks’. In: CoRR abs/1605.00064 (2016) (cited on pages 171, 201).Google Scholar

Sorenson, H. W. and Alspach, D. L.. ‘Recursive Bayesian Estimation Using Gaussian Sums’. In: Automatica 7.4 (1971), pp. 465–479. doi: https://doi.org/10.1016/0005-1098(71)90097-5 (cited on page 268).Google Scholar

Srivastava, Nitish et al. ‘Dropout: A Simple Way to Prevent Neural Networks from Overfitting’. In: Journal of Machine Learning Research 15.1 (Jan. 2014), pp. 1929–1958 (cited on page 195).Google Scholar

Stephenson, W.. ‘Technique of Factor Analysis’. In: Nature 136.297 (1935). doi: https://doi.org/10.1038/136297b0 (cited on pages 293, 294, 296, 298).Google Scholar

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. ‘Sequence to Sequence Learning with Neural Networks’. In: Advances in Neural Information Processing Systems 27. Ed. by Ghahramani, Z. et al. Red Hook, NY: Curran Associates, Inc., 2014, pp. 3104–3112 (cited on page 198).Google Scholar

Sutton, C. and McCallum, A.. ‘An Introduction to Conditional Random Fields for Relational Learning’. In: Introduction to Statistical Relational Learning. Ed. by Getoor, Lise and Taskar, Ben. Cambridge, MA: MIT Press, 2007 (cited on pages 366, 369).Google Scholar

Sutton, Richard S. and Barto, Andrew G.. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press, 2018 (cited on page 15).Google Scholar

Tenenbaum, Joshua B., de Silva, Vin, and Langford, John C.. ‘A Global Geometric Framework for Nonlinear Dimensionality Reduction’. In: Science 290.5500 (2000), p. 2319 (cited on page 88).Google Scholar

Tibshirani, Robert. ‘Regression Shrinkage and Selection Via the LASSO’. In: Journal of the Royal Statistical Society, Series B 58 (1994), pp. 267–288 (cited on page 140).Google Scholar

Tipping, M. E. and Bishop, Christopher. ‘Mixtures of Probabilistic Principal Component Analyzers’. In: Neural Computation 11 (Jan. 1999), pp. 443–482 (cited on pages 297, 298).Google Scholar

Tipping, Michael E. and Bishop, Chris M.. ‘Probabilistic Principal Component Analysis’. In: Journal of the Royal Statistical Society, Series B 61.3 (1999), pp. 611–622 (cited on pages 293, 294, 296).Google Scholar

Titterington, D. M., Smith, A. F. M., and Makov, U. E.. Statistical Analysis of Finite Mixture Distributions. New York, NY: Wiley, 1985 (cited on page 257).Google Scholar

Turney, Peter D. and Pantel, Patrick. ‘From Frequency to Meaning: Vector Space Models of Semantics’. In: Journal of Artificial Intelligence Research 37.1 (Jan. 2010), pp. 141–188 (cited on pages 142, 149).Google Scholar

Vanschoren, Joaquin. ‘Meta-Learning’. In: Automated Machine Learning: Methods, Systems, Challenges. Ed. by Hutter, Frank, Kotthoff, Lars, and Vanschoren, Joaquin. Cham, Switzerland: Springer International Publishing, 2019, pp. 35–61. doi: 10.1007/978-3-030-05318-5_2 (cited on page 16).Google Scholar

Vapnik, Vladimir N.. The Nature of Statistical Learning Theory. Berlin, Germany: Springer-Verlag, 1995 (cited on pages 102, 103).Google Scholar

Vapnik, Vladimir N.. Statistical Learning Theory. Hoboken, NJ: Wiley-Interscience, 1998 (cited on pages 102, 103).Google Scholar

Vaswani, Ashish et al. ‘Attention Is All You Need’. In: Advances in Neural Information Processing Systems 30. Ed. by Von Luxburg, U.. Red Hook, NY: Curran Associates, Inc., 2017, pp. 5998–6008 (cited on pages 164, 172, 173, 199).Google Scholar

Viterbi, Andrew J.. ‘Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.’ In: IEEE Transactions on Information Theory 13.2 (1967), pp. 260–269 (cited on pages 279, 357).Google Scholar

Waibel, Alexander et al. ‘Phoneme Recognition Using Time-Delay Neural Networks’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 37.3 (1989), pp. 328–339 (cited on page 161).Google Scholar

Waterhouse, Steve R., MacKay, David, and Robinson, Anthony J.. ‘Bayesian Methods for Mixtures of Experts’. In: Advances in Neural Information Processing Systems 8. Ed. by Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E.. Cambridge, MA: MIT Press, 1996, pp. 351–357 (cited on page 326).Google Scholar

Watkins, C. J. C. H.. ‘Learning from Delayed Rewards’. PhD thesis. Oxford, England: King's College, 1989 (cited on page 15).Google Scholar

Werbos, P. J.. ‘Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences’. PhD thesis. Cambridge, MA: Harvard University, 1974 (cited on pages 153, 176).Google Scholar

Weston, J. and Watkins, C.. ‘Support Vector Machines for Multiclass Pattern Recognition’. In: Proceedings of the Seventh European Symposium on Artificial Neural Networks. European Symposium on Artificial Neural Networks, Apr. 1999 (cited on page 127).Google Scholar

Williams, C. K. I. and Barber, D.. ‘Bayesian Classification with Gaussian Processes’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20.12 (1998), pp. 1342–1351 (cited on page 339).Google Scholar

Wolpert, David H.. ‘Stacked Generalization’. In: Neural Networks 5.2 (1992), pp. 241–259. doi: https://doi.org/10.1016/S0893-6080(05)80023-1 (cited on page 204).Google Scholar

Wolpert, David H.. ‘The Lack of a Priori Distinctions between Learning Algorithms’. In: Neural Computation 8.7 (Oct. 1996), pp. 1341–1390. doi: 10.1162/neco.1996.8.7.1341 (cited on page 11).Google Scholar

Yamaguchi, Kouichi et al. ‘A Neural Network for Speaker-Independent Isolated Word Recognition’. In: First International Conference on Spoken Language Processing (ICSLP 90). International Symposium on Computer Architecture, 1990, pp. 1077–1080 (cited on page 159).Google Scholar

Yang, Liu and Jin, Rong. Distance Metric Learning: A Comprehensive Survey. 2006. url: https://www.cs.cmu.edu/~liuy/frame_survey_v2.pdf (cited on page 13).Google Scholar

Young, Steve. ‘A Review of Large Vocabulary Continuous Speech Recognition’. In: IEEE Signal Processing Magazine 13.5 (Sept. 1996), pp. 45–57. doi: 10.1109/79.536824 (cited on page 276).Google Scholar

Young, Steve J., Russell, N. H., and Thornton, J. H. S. Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems. Tech. rep. Cambridge, MA: Cambridge University Engineering Department, 1989 (cited on page 280).Google Scholar

Young, Steve et al. The HTK Book. Tech. rep. Cambridge, MA: Cambridge University Engineering Department, 2002 (cited on page 286).Google Scholar

Zakka, Kevin. Deriving the Gradient for the Backward Pass of Batch Normalization. 2016. url: http://kevinzakka.github.io/2016/09/14/batch_normalization/ (visited on 11/20/2019) (cited on page 183).Google Scholar

Zeiler, Matthew D.. ‘ADADELTA: An Adaptive Learning Rate Method’. In: CoRR abs/1212.5701 (2012) (cited on page 192).Google Scholar

Zhang, Shiliang, Jiang, Hui, and Dai, Lirong. ‘Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Learn Neural Networks’. In: Journal of Machine Learning Research 17.37 (2016), pp. 1–33. doi: http://jmlr.org/papers/v17/15-335.html (cited on pages 293, 294, 302, 303, 379).Google Scholar

Zhang, Shiliang et al. ‘Feedforward Sequential Memory Networks: A New Structure to Learn Long-Term Dependency’. In: CoRR abs/1512.08301 (2015) (cited on pages 161, 202).Google Scholar

Zhang, Shiliang et al. ‘Rectified Linear Neural Networks with Tied-Scalar Regularization for LVCSR’. In: INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6–10, 2015. International Speech Communication Association, 2015, pp. 2635–2639 (cited on page 194).Google Scholar

Zhang, Shiliang et al. ‘The Fixed-Size Ordinally-Forgetting Encoding Method for Neural Network Language Models’. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China: Association for Computational Linguistics, July 2015, pp. 495–500. doi: 10.3115/v1/P15-2081 (cited on page 78).Google Scholar

Zhang, Shiliang et al. ‘Nonrecurrent Neural Structure for Long-Term Dependence’. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (2017), pp. 871–884 (cited on page 161).Google Scholar

Book contents

Bibliography

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive