Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-cjp7w Total loading time: 0 Render date: 2024-07-05T17:24:15.184Z Has data issue: false hasContentIssue false

Bibliography

Published online by Cambridge University Press:  18 November 2021

Hui Jiang
Affiliation:
York University, Toronto
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Machine Learning Fundamentals
A Concise Introduction
, pp. 381 - 396
Publisher: Cambridge University Press
Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abramowitz, Milton and Stegun, Irene A.. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Mineola, NY: Dover, 1964 (cited on pages 331, 379).Google Scholar
Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. ‘Wasserstein Generative Adversarial Networks’. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Precup, Doina and Teh, Yee Whye. Vol. 70. Sydney, Australia: PMLR, 2017, pp. 214223 (cited on page 295).Google Scholar
Asadi, Behnam and Jiang, Hui. ‘On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks’. In: CoRR abs/2002.04060 (2020) (cited on page 155).Google Scholar
Attias, Hagai. ‘Independent Factor Analysis’. In: Neural Computation 11.4 (1999), pp. 803851. doi: 10.1162/089976699300016458 (cited on pages 293, 294, 301, 302).Google Scholar
Attias, Hagai. ‘A Variational Bayesian Framework for Graphical Models’. In: Advances in Neural Information Processing Systems 12. Cambridge, MA: MIT Press, 2000, pp. 209215 (cited on pages 324, 326, 357).Google Scholar
Azevedo-Filho, Adriano. ‘Laplace's Method Approximations for Probabilistic Inference in Belief Networks with Continuous Variables’. In: Uncertainty in Artificial Intelligence. Ed. by de Mantaras, Ramon Lopez and Poole, David. San Francisco, CA: Morgan Kaufmann, 1994, pp. 2836 (cited on page 324).Google Scholar
Ba, Lei Jimmy, Kiros, Jamie Ryan, and Hinton, Geoffrey E.. ‘Layer Normalization’. In: CoRR abs/1607.06450 (2016) (cited on page 160).Google Scholar
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. ‘Neural Machine Translation by Jointly Learning to Align and Translate’. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, May 7–9, 2015, Conference Track Proceedings. ICLR, 2015 (cited on page 163).Google Scholar
Baker, James. ‘The DRAGON System—An Overview’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 23.1 (1975), pp. 2429 (cited on pages 2, 3).Google Scholar
Bakir, Gükhan H. et al. Predicting Structured Data (Neural Information Processing). Cambridge, MA: MIT Press, 2007 (cited on page 4).Google Scholar
Baldi, P. and Hornik, K.. ‘Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima’. In: Neural Networks 2.1 (Jan. 1989), pp. 5358. doi: 10.1016/0893-6080(89)90014-2 (cited on page 91).Google Scholar
Banerjee, Arindam et al. ‘Clustering on the Unit Hypersphere Using von Mises-Fisher Distributions’. In: Journal of Machine Learning Research 6 (Dec. 2005), pp. 13451382 (cited on page 379).Google Scholar
Barber, David. Bayesian Reasoning and Machine Learning. Cambridge, England: Cambridge University Press, 2012 (cited on pages 343, 357).Google Scholar
Bartholomew, David. Latent Variable Models and Factor Analysis. A Unified Approach. Chichester, England: Wiley, 2011 (cited on page 299).Google Scholar
Baum, Leonard E.. ‘An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes’. In: Inequalities 3 (1972), pp. 18 (cited on pages 276, 281).Google Scholar
Baum, Leonard E. and Petrie, Ted. ‘Statistical Inference for Probabilistic Functions of Finite State Markov Chains’. In: Annals of Mathematical Statistics 37.6 (Dec. 1966), pp. 15541563. doi: 10.1214/aoms/1177699147 (cited on page 276).CrossRefGoogle Scholar
Baum, Leonard E. et al. ‘A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains’. In: Annals of Mathematical Statistics 41.1 (Feb. 1970), pp. 164171. doi: 10.1214/aoms/1177697196 (cited on pages 276, 281).Google Scholar
Bell, A. J. and Sejnowski, T. J.. ‘An Information Maximization Approach to Blind Separation and Blind Deconvolution.’ In: Neural Computation 7 (1995), pp. 11291159 (cited on pages 293, 294).Google Scholar
Ben-David, Shai et al. ‘A Theory of Learning from Different Domains’. In: Machine Learning 79.1–2 (May 2010), pp. 151175. doi: 10.1007/s10994-009-5152-4 (cited on page 16).CrossRefGoogle Scholar
Berger, Adam L., Pietra, Stephen A. Della, and Pietra, Vincent J. Della. ‘A Maximum Entropy Approach to Natural Language Processing’. In: Computational Linguistics 22 (1996), pp. 3971 (cited on page 254).Google Scholar
Bertsekas, Dimitri and Tsitsiklis, John. Introduction to Probability. Nashua, NH: Athena Scientific, 2002 (cited on page 40).Google Scholar
Bishop, Christopher M.. Pattern Recognition and Machine Learning (Information Science and Statistics). 1st ed. New York, NY: Springer, 2007 (cited on pages 343, 344, 350, 357, 368).Google Scholar
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.. ‘Latent Dirichlet Allocation’. In: Journal of Machine Learning Research 3 (Mar. 2003), pp. 9931022 (cited on pages 363, 365, 366).Google Scholar
Bottou, Léon. ‘On-Line Learning and Stochastic Approximations’. In: On-Line Learning in Neural Networks. Ed. by Saad, D.. Cambridge, England: Cambridge University Press, 1998, pp. 942 (cited on page 61).Google Scholar
Bousquet, Olivier, Boucheron, Stéphane, and Lugosi, Gábor. ‘Introduction to Statistical Learning Theory’. In: Advanced Lectures on Machine Learning. Ed. by Bousquet, Olivier, von Luxburg, Ulrike, and Rätsch, Gunnar. Vol. 3176. Springer, 2003, pp. 169207 (cited on pages 102, 103).Google Scholar
Box, G. E. P. and Tiao, G. C.. Bayesian Inference in Statistical Analysis. Reading, MA: Addison-Wesley, 1973 (cited on page 318).Google Scholar
Box, M. J., Davies, D., and Swann, W. H.. Non-Linear Optimisation Techniques. Edinburgh, Scotland: Oliver & Boyd, 1969 (cited on page 71).Google Scholar
Boyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge, England: Cambridge University Press, 2004 (cited on page 50).Google Scholar
Boyd, Stephen et al. ‘Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers’. In: Foundations and Trends in Machine Learning 3.1 (Jan. 2011), pp. 1122. doi: 10.1561/2200000016 (cited on page 71).Google Scholar
Breiman, Leo. ‘Bagging Predictors’. In: Machine Learning 24.2 (1996), pp. 123140 (cited on pages 204, 208).Google Scholar
Breiman, Leo. ‘Stacked Regressions’. In: Machine Learning 24.1 (July 1996), pp. 4964. doi: 10.1023/A:1018046112532 (cited on page 204).Google Scholar
Breiman, Leo. ‘Prediction Games and Arcing Algorithms’. In: Neural Computation 11.7 (Oct. 1999), pp. 14931517. doi: 10.1162/089976699300016106 (cited on page 210).Google Scholar
Breiman, Leo. ‘Random Forests’. In: Machine Learning 45.1 (2001), pp. 532. doi: 10.1023/A:1010933404324 (cited on pages 208, 209).Google Scholar
Breiman, Leo et al. Classification and Regression Trees. Monterey, CA: Wadsworth and Brooks, 1984 (cited on pages 7, 205).Google Scholar
Bridle, John S.. ‘Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition’. In: Neurocomputing. Ed. by Soulié, Françoise Fogelman and Hérault, Jeanny. Berlin, Germany: Springer, 1990, pp. 227236 (cited on pages 115, 159).Google Scholar
Bridle, John S.. ‘Training Stochastic Model Recognition Algorithms as Networks Can Lead to Maximum Mutual Information Estimation of Parameters’. In: Advances in Neural Information Processing Systems (NIPS). Vol. 2. San Mateo, CA: Morgan Kaufmann, 1990, pp. 211217 (cited on pages 115, 159).Google Scholar
Brown, Peter, Lee, Chin-Hui, and Spohrer, J.. ‘Bayesian Adaptation in Speech Recognition’. In: ICASSP ‘83. IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 8. Washington, D.C.: IEEE Computer Society, 1983, pp. 761764 (cited on page 16).Google Scholar
Brown, Peter et al. ‘A Statistical Approach to Language Translation’. In: Proceedings of the 12th Conference on Computational Linguistics—Volume 1. COLING ‘88. Budapest, Hungary: Association for Computational Linguistics, 1988, pp. 7176. doi: 10.3115/991635.991651 (cited on pages 2, 3).Google Scholar
Candès, E. J. and Wakin, M. B.. ‘An Introduction to Compressive Sampling’. In: IEEE Signal Processing Magazine 25.2 (2008), pp. 2130 (cited on page 146).Google Scholar
Chaikin, P. M. and Lubensky, T. C.. Principles of Condensed Matter Physics. Cambridge, England: Cambridge University Press, 1995 (cited on page 327).Google Scholar
Chang, Chih-Chung and Lin, Chih-Jen. ‘LIBSVM: A Library for Support Vector Machines’. In: ACM Transactions on Intelligent Systems and Technology 2.3 (2011). Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 27:1–27:27 (cited on page 125).Google Scholar
Chen, Tianqi and Guestrin, Carlos. ‘XGBoost: A Scalable Tree Boosting System’. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ed. by Krishnapuram, Balaji. New York, NY: Association for Computing Machinery, Aug. 2016. doi: 10.1145/2939672.2939785 (cited on page 215).Google Scholar
Cho, Kyunghyun et al. ‘Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.’ In: EMNLP. Ed. by Moschitti, Alessandro, Pang, Bo, and Daelemans, Walter. Stroudsburg, PA: Association for Computational Linguistics, 2014, pp. 17241734 (cited on page 171).Google Scholar
Cock, Dean. ‘Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project’. In: Journal of Statistics Education 19 (Nov. 2011). doi: 10.1080/10691898.2011.11889627 (cited on page 216).Google Scholar
Cortes, Corinna and Vapnik, Vladimir. ‘Support-Vector Networks’. In: Machine Learning 20.3 (Sept. 1995), pp. 273297. doi: 10.1023/A:1022627411411 (cited on page 124).Google Scholar
Crammer, Koby and Singer, Yoram. ‘On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines’. In: Journal of Machine Learning Research 2 (Mar. 2002), pp. 265292 (cited on page 127).Google Scholar
Cybenko, G.. ‘Approximation by Superpositions of a Sigmoidal Function’. In: Mathematics of Control, Signals, and Systems (MCSS) 2.4 (Dec. 1989), pp. 303314. doi: 10.1007/BF02551274 (cited on page 154).Google Scholar
Dasarathy, B. V. and Sheela, B. V.. ‘A Composite Classifier System Design: Concepts and Methodology’. In: Proceedings of the IEEE. Vol. 67. Washington, D.C.: IEEE Computer Society, 1979, pp. 708713 (cited on page 203).Google Scholar
Davis, Steven B. and Mermelstein, Paul. ‘Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences’. In: IEEE Transactions on Acoustics, Speech and Signal Processing 28.4 (1980), pp. 357366 (cited on page 77).CrossRefGoogle Scholar
Deerwester, Scott et al. ‘Indexing by Latent Semantic Analysis’. In: Journal of the American Society for Information Science 41.6 (1990), pp. 391407 (cited on page 142).Google Scholar
DeGroot, M. H.. Optimal Statistical Decisions. New York, NY: McGraw-Hill, 1970 (cited on page 318).Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B.. ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’. In: Journal of the Royal Statistical Society, Series B 39.1 (1977), pp. 138 (cited on pages 265, 315).Google Scholar
Dharmadhikari, S. W. and Jogdeo, Kumar. ‘Multivariate Unimodality’. In: Annals of Statistics 4.3 (May 1976), pp. 607613. doi: 10.1214/aos/1176343466 (cited on page 239).Google Scholar
Domingos, Pedro. ‘A Few Useful Things to Know about Machine Learning’. In: Communications of the ACM 55.10 (Oct. 2012), pp. 7887. doi: 10.1145/2347736.2347755 (cited on pages 14, 15).Google Scholar
Duchi, John, Hazan, Elad, and Singer, Yoram. ‘Adaptive Subgradient Methods for Online Learning and Stochastic Optimization’. In: Journal of Machine Learning Research 12 (July 2011), pp. 21212159 (cited on page 192).Google Scholar
Duda, Richard O. and Hart, Peter E.. Pattern Classification and Scene Analysis. New York, NY: John Wiley & Sons, 1973 (cited on page 2).Google Scholar
Duda, Richard O., Hart, Peter E., and Stork, David G.. Pattern Classification. 2nd ed. New York, NY: Wiley, 2001 (cited on pages 7, 11, 226).Google Scholar
Elahi, Mehdi, Ricci, Francesco, and Rubens, Neil. ‘A Survey of Active Learning in Collaborative Filtering Recommender Systems’. In: Computer Science Review 20.C (May 2016), pp. 2950. doi: 10.1016/j.cosrev.2016.05.002 (cited on page 17).Google Scholar
Everitt, B. and Hand, D. J.. Finite Mixture Distributions. Monographs on Applied Probability and Statistics. New York, NY: Springer, 1981 (cited on page 257).Google Scholar
Fahlman, Scott E.. An Empirical Study of Learning Speed in Back-Propagation Networks. Tech. rep. CMU-CS-88-162. Pittsburgh, PA: Computer Science Department, Carnegie Mellon University, 1988 (cited on page 63).Google Scholar
Ferguson, Thomas S.. ‘A Bayesian Analysis of Some Nonparametric Problems’. In: The Annals of Statistics 1 (1973), pp. 209230 (cited on page 333).Google Scholar
Finkelstein, Lev et al. ‘Placing Search in Context: The Concept Revisited’. In: Proceedings of the 10th International Conference on World Wide Web. New York, NY: Association for Computing Machinery, 2001, pp. 406414. doi: 10.1145/503104.503110 (cited on page 149).Google Scholar
Fiscus, Jonathan. ‘A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER)’. In: IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings. Washington, D.C.: IEEE Computer Society, Aug. 1997, pp. 347354 (cited on page 203).Google Scholar
Fisher, R. A.. ‘The Use of Multiple Measurements in Taxonomic Problems’. In: Annals of Eugenics 7.7 (1936), pp. 179188 (cited on page 85).Google Scholar
Fletcher, R.. Practical Methods of Optimization. 2nd ed. Hoboken, NJ: Wiley-Interscience, 1987 (cited on page 63).Google Scholar
Forgy, E.. ‘Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification’. In: Biometrics 21.3 (1965), pp. 768769 (cited on pages 5, 270).Google Scholar
Foucart, Simon and Rauhut, Holger. A Mathematical Introduction to Compressive Sensing. Basel, Switzerland: Birkhäuser, 2013 (cited on page 146).Google Scholar
Freund, Yoav and Schapire, Robert E. ‘A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting’. In: Journal of Computer and System Sciences 55.1 (Aug. 1997), pp. 119139. doi: 10.1006/jcss.1997.1504 (cited on pages 204, 210, 214).Google Scholar
Freund, Yoav and Schapire, Robert E.. ‘Large Margin Classification Using the Perceptron Algorithm’. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory. COLT’ 98. Madison, Wisconsin: ACM, 1998, pp. 209217. doi: 10.1145/279943.279985 (cited on page 111).Google Scholar
Frey, Brendan J.. Graphical Models for Machine Learning and Digital Communication. Cambridge, MA: MIT Press, 1998 (cited on page 357).CrossRefGoogle Scholar
Frey, Brendan J. and MacKay, David J. C.. ‘A Revolution: Belief Propagation in Graphs with Cycles’. In: Advances in Neural Information Processing Systems 10. Ed. by Jordan, M. I., Kearns, M. J., and Solla, S. A.. Cambridge, MA: MIT Press, 1998, pp. 479485 (cited on page 357).Google Scholar
Friedman, Jerome H.. ‘Greedy Function Approximation: A Gradient Boosting Machine’. In: Annals of Statistics 29 (2000), pp. 11891232 (cited on pages 210, 211, 215).Google Scholar
Friedman, Jerome H.. ‘Stochastic Gradient Boosting’. In: Computational Statistics and Data Analysis 38.4 (Feb. 2002), pp. 367378. doi: 10.1016/S0167-9473(01)00065-2 (cited on pages 211, 215).Google Scholar
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Rob. ‘Additive Logistic Regression: a Statistical View of Boosting’. In: The Annals of Statistics 38.2 (2000) (cited on pages 211, 212, 215).Google Scholar
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Rob. ‘Regularization Paths for Generalized Linear Models via Coordinate Descent’. In: Journal of Statistical Software 33.1 (2010), pp. 122. doi: 10.18637/jss.v033.i01 (cited on page 140).Google Scholar
Fukushima, Kunihiko. ‘Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position’. In: Biological Cybernetics 36 (1980), pp. 193202 (cited on page 157).Google Scholar
Gauvain, J. and Lee, Chin-Hui. ‘Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains’. In: IEEE Transactions on Speech and Audio Processing 2.2 (1994), pp. 291– 298 (cited on page 16).Google Scholar
Geisser, S.. Predictive Inference: An Introduction. New York, NY: Chapman & Hall, 1993 (cited on page 314).CrossRefGoogle Scholar
Ghahramani, Zoubin. Non-Parametric Bayesian Methods. 2005. url: http://mlg.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf (visited on 03/10/2020) (cited on page 335).Google Scholar
Glick, Ned. ‘Sample-Based Classification Procedures Derived from Density Estimators’. In: Journal of the American Statistical Association 67 (1972), pp. 116122 (cited on pages 229, 230).Google Scholar
Glick, Ned. ‘Sample-Based Classification Procedures Related to Empiric Distributions’. In: IEEE Transactions on Information Theory 22 (1976), pp. 454461 (cited on page 229).Google Scholar
Glorot, Xavier and Bengio, Yoshua. ‘Understanding the Difficulty of Training Deep Feedforward Neural Networks’. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010, pp. 249256 (cited on pages 153, 190).Google Scholar
Good, I. J.. ‘The Population Frequencies of Species and the Estimation of Population Parameters’. In: Biometrika 40.3–4 (Dec. 1953), pp. 237264. doi: 10.1093/biomet/40.3-4.237 (cited on page 250).Google Scholar
Goodfellow, Ian et al. ‘Generative Adversarial Nets’. In: Advances in Neural Information Processing Systems 27. Ed. by Ghahramani, Z. et al. Red Hook, NY: Curran Associates, Inc., 2014, pp. 26722680 (cited on pages 293–295, 307, 308).Google Scholar
Gregor, Karol et al. ‘DRAW: A Recurrent Neural Network for Image Generation’. In: Proceedings of the 32nd International Conference on Machine Learning. Ed. by Bach, Francis and Blei, David. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 14621471 (cited on page 295).Google Scholar
Grezl, F. et al. ‘Probabilistic and Bottle-Neck Features for LVCSR of Meetings’. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 4. Washington, D.C.: IEEE Computer Society, 2007, pp. 757760 (cited on page 91).Google Scholar
Gruber, M. H. J.. Improving Efficiency by Shrinkage: The James–Stein and Ridge Regression Estimators. Boca Raton, FL: CRC Press, 1998, pp. 715 (cited on page 139).Google Scholar
Guyon, Isabelle and Elisseeff, André. ‘An Introduction to Variable and Feature Selection’. In: Journal of Machine Learning Research 3 (Mar. 2003), pp. 11571182 (cited on page 78).Google Scholar
Haff, L. R.. ‘An Identity for the Wishart Distribution with Applications’. In: Journal of Multivariate Analysis 9.4 (Dec. 1979), pp. 531544 (cited on page 322).Google Scholar
Hansen, L. K. and Salamon, P.. ‘Neural Network Ensembles’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 12.10 (Oct. 1990), pp. 9931001. doi: 10.1109/34.58871 (cited on page 203).Google Scholar
Harris, Zellig. ‘Distributional Structure’. In: Word 10.23 (1954), pp. 146162 (cited on pages 5, 77, 142).Google Scholar
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY: Springer, 2001 (cited on pages 138, 205, 207).Google Scholar
Hellman, Martin E. and Raviv, Josef. ‘Probability of Error, Equivocation and the Chernoff Bound’. In: IEEE Transactions on Information Theory 16 (1970), pp. 368372 (cited on page 226).Google Scholar
Hermansky, H., Ellis, D. P. W., and Sharma, S.. ‘Tandem Connectionist Feature Extraction for Conventional HMM Systems’. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. Vol. 3. Washington, D.C.: IEEE Computer Society, 2000, pp. 16351638 (cited on page 91).Google Scholar
Hihi, Salah El and Bengio, Yoshua. ‘Hierarchical Recurrent Neural Networks for Long-Term Dependencies’. In: Advances in Neural Information Processing Systems 8. Ed. by Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E.. Cambridge, MA: MIT Press, 1996, pp. 493499 (cited on page 171).Google Scholar
Hinton, Geoffrey E.. ‘Training Products of Experts by Minimizing Contrastive Divergence’. In: Neural Computation 14.8 (2002), pp. 17711800. doi: 10.1162/089976602760128018 (cited on page 371).Google Scholar
Hinton, Geoffrey E.. ‘A Practical Guide to Training Restricted Boltzmann Machines.’ In: Neural Networks: Tricks of the Trade. Ed. by Montavon, Grégoire, Orr, Genevieve B., and Müller, Klaus-Robert. 2nd ed. Vol. 7700. New York, NY: Springer, 2012, pp. 599619 (cited on pages 366, 370).Google Scholar
Hinton, Geoffrey and Roweis, Sam. ‘Stochastic Neighbor Embedding’. In: Advances in Neural Information Processing Systems. Ed. by Thrun, S. Becker, S. and Obermayer, K.. Vol. 15. Cambridge, MA: MIT Press, 2003, pp. 833840 (cited on page 89).Google Scholar
Kam Ho, Tin. ‘Random Decision Forests’. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). ICDAR ‘95. Washington, D.C.: IEEE Computer Society, 1995, p. 278 (cited on pages 208, 209).Google Scholar
Ho, Tin Kam, Hull, Jonathan J., and Srihari, Sargur N.. ‘Decision Combination in Multiple Classifier Systems’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 16.1 (Jan. 1994), pp. 6675. doi: 10.1109/34.273716 (cited on page 203).Google Scholar
Hochreiter, Sepp and Schmidhuber, Jürgen. ‘Long Short-Term Memory’. In: Neural Computation 9.8 (Nov. 1997), pp. 17351780. doi: 10.1162/neco.1997.9.8.1735 (cited on page 171).Google Scholar
Hornik, Kurt. ‘Approximation Capabilities of Multilayer Feedforward Networks’. In: Neural Networks 4.2 (Mar. 1991), pp. 251257. doi: 10.1016/0893-6080(91)90009-T (cited on pages 154, 155).Google Scholar
Hotelling, H.. ‘Analysis of a Complex of Statistical Variables into Principal Components.’ In: Journal of Educational Psychology 24.6 (1933), pp. 417441. doi: 10.1037/h0071325 (cited on page 80).Google Scholar
Huo, Qiang. ‘An Introduction to Decision Rules for Automatic Speech Recognition’. In: Technical Report TR-99-07. Hong Kong: Department of Computer Science and Information Systems, University of Hong Kong, 1999 (cited on page 229).Google Scholar
Huo, Qiang and Lee, Chin-Hui. ‘On-Line Adaptive Learning of the Continuous Density Hidden Markov Model Based on Approximate Recursive Bayes Estimate’. In: IEEE Transactions on Speech and Audio Processing 5.2 (1997), pp. 161172 (cited on page 17).Google Scholar
Hussein, Ahmed et al. ‘Imitation Learning: A Survey of Learning Methods’. In: ACM Computing Surveys 50.2 (Apr. 2017). doi: 10.1145/3054912 (cited on page 17).Google Scholar
Hyvärinen, Aapo and Oja, Erkki. ‘Independent Component Analysis: Algorithms and Applications’. In: Neural Networks 13 (2000), pp. 411430 (cited on pages 293, 294, 301).Google Scholar
Ioffe, Sergey and Szegedy, Christian. ‘Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift’. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37. ICML’15. Lille, France: Journal of Machine Learning Research, 2015, pp. 448456 (cited on page 160).Google Scholar
Jaakkola, Tommi S. and Jordan, Michael I.. A Variational Approach to Bayesian Logistic Regression Models and Their Extensions. 1996. url: https://people.csail.mit.edu/tommi/papers/aistat96.ps (visited on 11/10/2019) (cited on page 326).Google Scholar
Jackson, Peter. Introduction to Expert Systems. 2nd ed. USA: Addison-Wesley Longman Publishing Co., Inc., 1990 (cited on page 2).Google Scholar
Jarrett, Kevin et al. ‘What Is the Best Multi-Stage Architecture for Object Recognition?’ In: 2009 IEEE 12th International Conference on Computer Vision. Washington, D.C.: IEEE Computer Society, 2009, pp. 21462153 (cited on page 153).Google Scholar
Jelinek, F., Bahl, L. R., and Mercer, R. L.. ‘Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech’. In: IEEE Transactions on Information Theory 21 (1975), pp. 250256 (cited on pages 2, 3).Google Scholar
Jensen, Finn V.. Introduction to Bayesian Networks. 1st ed. Berlin, Germany: Springer-Verlag, 1996 (cited on page 343).Google Scholar
Jensen, J. L. W. V.. ‘Sur les fonctions convexes et les inégalités entre les valeurs moyennes’. In: Acta Mathematica 30.1 (1906), pp. 175193 (cited on page 46).Google Scholar
Jiang, Hui. ‘A New Perspective on Machine Learning: How to Do Perfect Supervised Learning’. In: CoRR abs/1901.02046 (2019) (cited on page 13).Google Scholar
Johnson, Richard Arnold and Wichern, Dean W.. Applied Multivariate Statistical Analysis. 5th ed. Upper Saddle River, NJ: Prentice Hall, 2002 (cited on page 378).Google Scholar
Jones, Karen Spärck. ‘A Statistical Interpretation of Term Specificity and Its Application in Retrieval’. In: Journal of Documentation 28 (1972), pp. 1121 (cited on page 78).Google Scholar
Jordan, Michael I., ed. Learning in Graphical Models. Cambridge, MA: MIT Press, 1999 (cited on page 343).Google Scholar
Jordan, Michael I. et al. ‘An Introduction to Variational Methods for Graphical Models’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 105161. doi: 10.1007/978-94-011-5014-9_5 (cited on page 357).Google Scholar
Juang, B. H.. ‘Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains’. In: AT&T Technical Journal 64.6 (July 1985), pp. 12351249. doi: 10.1002/j.1538-7305.1985.tb00273.x (cited on page 284).Google Scholar
Juang, B. H. and Rabiner, L. R.. ‘The Segmental K-Means Algorithm for Estimating Parameters of Hidden Markov Models’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 38.9 (Sept. 1990), pp. 16391641. doi: 10.1109/29.60082 (cited on page 286).Google Scholar
Emil Kalman, Rudolph. ‘A New Approach to Linear Filtering and Prediction Problems’. In: Journal of Basic Engineering 82.1 (1960), pp. 3545 (cited on page 69).Google Scholar
Karras, Tero, Laine, Samuli, and Aila, Timo. ‘A Style-Based Generator Architecture for Generative Adversarial Networks.’ In: CoRR abs/1812.04948 (2018) (cited on page 295).Google Scholar
Karush, William. ‘Minima of Functions of Several Variables with Inequalities as Side Conditions’. MA thesis. Chicago, IL: Department of Mathematics, University of Chicago, 1939 (cited on page 57).Google Scholar
Katz, Slava M.. ‘Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer’. In: IEEE Transactions on Acoustics, Speech and Signal Processing. 1987, pp. 400401 (cited on page 250).Google Scholar
Kechris, Alexander S.. Classical Descriptive Set Theory. Berlin, Germany: Springer-Verlag, 1995 (cited on page 291).Google Scholar
Kendall, M. G., Stuart, A., and Ord, J. K.. Kendall's Advanced Theory of Statistics. Oxford, England: Oxford University Press, 1987 (cited on page 323).Google Scholar
Kinderman, R. and Snell, S. L.. Markov Random Fields and Their Applications. Ann Arbor, MI: American Mathematical Society, 1980 (cited on pages 344, 366).Google Scholar
Kingma, Diederik P. and Ba, Jimmy. ‘ADAM: AMethod for Stochastic Optimization.’ In: CoRR abs/1412.6980 (2014) (cited on page 192).Google Scholar
Kingma, Diederik P. and Welling, Max. ‘Auto-Encoding Variational Bayes’. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. ICLR, 2014 (cited on pages 293, 294, 305, 306).Google Scholar
Koren, Yehuda, Bell, Robert, and Volinsky, Chris. ‘Matrix Factorization Techniques for Recommender Systems’. In: Computer 42.8 (Aug. 2009), pp. 3037. doi: 10.1109/MC.2009.263 (cited on page 143).Google Scholar
Kramer, Mark A.. ‘Nonlinear Principal Component Analysis Using Autoassociative Neural Networks’. In: AIChE Journal 37.2 (1991), pp. 233243. doi: 10.1002/aic.690370209 (cited on page 90).Google Scholar
Krogh, Anders and Hertz, John A.. ‘A Simple Weight Decay Can Improve Generalization’. In: Advances in Neural Information Processing Systems 4. Ed. by Moody, J. E., Hanson, S. J., and Lippmann, R. P.. Burlington, MA: Morgan-Kaufmann, 1992, pp. 950957 (cited on page 194).Google Scholar
Kschischang, F. R., Frey, B. J., and Loeliger, H. A.. ‘Factor Graphs and the Sum-Product Algorithm’. In: IEEE Transactions on Information Theory 47.2 (Sept. 2006), pp. 498519. doi: 10.1109/18.910572 (cited on pages 357, 360).Google Scholar
Kuhn, H. W. and Tucker, A. W.. ‘Nonlinear Programming’. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press, 1951, pp. 481492 (cited on page 57).Google Scholar
Kulis, Brian. ‘Metric Learning: A Survey’. In: Foundations and Trends in Machine Learning 5.4 (2013), pp. 287364. doi: 10.1561/2200000019 (cited on page 13).Google Scholar
Kullback, S. and Leibler, R. A.. ‘On Information and Sufficiency’. In: Annals of Mathematical Statistics 22.1 (1951), pp. 7986 (cited on page 41).Google Scholar
Lafferty, John D., McCallum, Andrew, and Pereira, Fernando C. N.. ‘Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data’. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML ‘01. San Francisco, CA: Morgan Kaufmann Publishers Inc., 2001, pp. 282289 (cited on pages 366, 368, 369).Google Scholar
Simon Laplace, Pierre. ‘Memoir on the Probability of the Causes of Events’. In: Statistical Science 1.3 (1986), pp. 364378 (cited on page 324).Google Scholar
Lauritzen, S. L. and Spiegelhalter, D. J.. ‘Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems’. In: Journal of the Royal Statistical Society. Series B (Methodological) 50.2 (1988), pp. 157224 (cited on pages 357, 361).Google Scholar
LeCun, Yann and Bengio, Yoshua. ‘Convolutional Networks for Images, Speech, and Time Series’. In: The Handbook of Brain Theory and Neural Networks. Ed. by Arbib, Michael A.. Cambridge, MA: MIT Press, 1998, pp. 255258 (cited on page 157).Google Scholar
LeCun, Yann et al. ‘Gradient-Based Learning Applied to Document Recognition’. In: Proceedings of the IEEE 86.11 (1998), pp. 22782324 (cited on pages 92, 129, 200).Google Scholar
Lee, Chin-Hui and Huo, Qiang. ‘On Adaptive Decision Rules and Decision Parameter Adaptation for Automatic Speech Recognition’. In: Proceedings of the IEEE 88.8 (2000), pp. 12411269 (cited on page 16).Google Scholar
Leggetter, C. J. and Woodland, P. C.. ‘Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models’. In: Computer Speech & Language 9.2 (1995), pp. 171185. doi: https://doi.org/10.1006/csla.1995.0010 (cited on page 16).Google Scholar
Linnainmaa, Seppo. ‘Taylor Expansion of the Accumulated Rounding Error’. In: BIT Numerical Mathematics 16.2 (June 1976), pp. 146160. doi: 10.1007/BF01931367 (cited on page 176).Google Scholar
Liu, Quan et al. ‘Learning Semantic Word Embeddings Based on Ordinal Knowledge Constraints’. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China: Association for Computational Linguistics, July 2015, pp. 15011511. doi: 10.3115/v1/P15-1145 (cited on page 149).Google Scholar
Lloyd, Stuart P.. ‘Least Squares Quantization in PCM’. In: IEEE Transactions on Information Theory 28 (1982), pp. 129137 (cited on page 270).Google Scholar
Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. ‘Fully Convolutional Networks for Semantic Segmentation’. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington, D.C.: IEEE Computer Society, June 2015 (cited on pages 198, 309).Google Scholar
Lowe, David G.. ‘Object Recognition from Local Scale-Invariant Features’. In: Proceedings of the International Conference on Computer Vision. ICCV ‘99. Washington, D.C.: IEEE Computer Society, 1999, p. 1150 (cited on page 77).Google Scholar
van der Maaten, Laurens and Hinton, Geoffrey. ‘Visualizing Data Using t-SNE’. In: Journal of Machine Learning Research 9 (2008), pp. 25792605 (cited on page 89).Google Scholar
MacKay, David J. C.. ‘The Evidence Framework Applied to Classification Networks’. In: Neural Computation 4.5 (1992), pp. 720736. doi: 10.1162/neco.1992.4.5.720 (cited on page 326).Google Scholar
MacKay, David J. C.. ‘Introduction to Gaussian Processes’. In: Neural Networks and Machine Learning. Ed. by Bishop, C. M.. NATO ASI Series. Amsterdam, Netherlands: Kluwer Academic Press, 1998, pp. 133166 (cited on page 333).Google Scholar
MacKay, David J. C.. Information Theory, Inference, and Learning Algorithms. Cambridge, England: Cambridge University Press, 2003 (cited on page 324).Google Scholar
MacKay, David J. C.. ‘Good Error-Correcting Codes Based on Very Sparse Matrices’. In: IEEE Transactions on Information Theory 45.2 (Sept. 2006), pp. 399431. doi: 10.1109/18.748992 (cited on page 357).Google Scholar
Mackay, David J. C.. ‘Introduction to Monte Carlo Methods’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 175204. doi: 10.1007/978-94-011-5014-9_7 (cited on pages 357, 361).Google Scholar
Mahoney, Matt. Large Text Compression Benchmark. 2011. url: http://mattmahoney.net/dc/textdata.html (visited on 11/10/2019) (cited on page 149).Google Scholar
Mairal, Julien et al. ‘Online Learning for Matrix Factorization and Sparse Coding’. In: Journal of Machine Learning Research 11 (Mar. 2010), pp. 1960 (cited on page 145).Google Scholar
Maritz, J. S. and Lwin, T.. Empirical Bayes Methods. London, England: Chapman & Hall, 1989 (cited on page 323).Google Scholar
Maron, M. E.. ‘Automatic Indexing: An Experimental Inquiry’. In: Journal of the ACM 8.3 (July 1961), pp. 404417. doi: 10.1145/321075.321084 (cited on page 362).CrossRefGoogle Scholar
Martens, James. ‘Deep Learning via Hessian-Free Optimization’. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10. Haifa, Israel: Omnipress, 2010, pp. 735742 (cited on page 63).Google Scholar
Mason, Llew et al. ‘Boosting Algorithms as Gradient Descent’. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. NIPS’99. Denver, CO: MIT Press, 1999, pp. 512518 (cited on pages 210, 212).Google Scholar
McLachlan, G. J. and Peel, D.. Finite Mixture Models. New York, NY: Wiley, 2000 (cited on page 257).Google Scholar
Mead, A.. ‘Review of the Development of Multidimensional Scaling Methods’. In: Journal of the Royal Statistical Society. Series D (The Statistician) 41.1 (1992), pp. 2739 (cited on page 88).Google Scholar
Minka, T. P.. ‘Expectation Propagation for Approximate Bayesian Inference’. In: Uncertainty in Artificial Intelligence. Vol. 17. Association for Uncertainty in Artificial Intelligence, 2001, pp. 362369 (cited on page 357).Google Scholar
Mitchell, Tom M.. Machine Learning. New York, NY: McGraw-Hill, 1997 (cited on page 2).Google Scholar
Mnih, Volodymyr et al. ‘Playing Atari with Deep Reinforcement Learning’. In: arXiv (2013). arXiv:1312.5602 (cited on page 15).Google Scholar
Mnih, Volodymyr et al. ‘Human-Level Control through Deep Reinforcement Learning’. In: Nature 518.7540 (Feb. 2015), pp. 529533 (cited on page 16).Google Scholar
Nair, Vinod and Hinton, Geoffrey E.. ‘Rectified Linear Units Improve Restricted Boltzmann Machines’. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). ICML, 2010, pp. 807814 (cited on page 153).Google Scholar
Neal, Radford M.. ‘Bayesian Mixture Modeling’. In: Maximum Entropy and Bayesian Methods: Seattle, 1991. Ed. by Smith, C. Ray, Erickson, Gary J., and Neudorfer, Paul O.. Dordrecht, Netherlands: Springer, 1992, pp. 197211. doi: 10.1007/978-94-017-2219-3_14 (cited on page 333).Google Scholar
Neal, Radford M. and Hinton, Geoffrey E.. ‘A View of the EM Algorithm That Justifies Incremental, Sparse, and Other Variants’. In: Learning in Graphical Models. Ed. by Jordan, Michael I.. Dordrecht, Netherlands: Springer, 1998, pp. 355368. doi: 10.1007/978-94-011-5014-9_12 (cited on page 327).Google Scholar
Nelder, J. A. and Wedderburn, R. W. M.. ‘Generalized Linear Models’. In: Journal of the Royal Statistical Society, Series A, General 135 (1972), pp. 370384 (cited on pages 239, 250).Google Scholar
Nesterov, Yurii. Introductory Lectures on Convex Optimization: A Basic Course. 1st ed. New York, NY: Springer, 2014 (cited on pages 49, 50).Google Scholar
Ney, H. and Ortmanns, S.. ‘Progress in Dynamic Programming Search for LVCSR’. In: Proceedings of the IEEE 88.8 (Aug. 2000), pp. 12241240. doi: 10.1109/5.880081 (cited on pages 276, 280).Google Scholar
Ng, Andrew. Machine Learning Yearning. 2018. url: http://www.deeplearning.ai/machine-learning-yearning/ (visited on 12/10/2019) (cited on page 196).Google Scholar
Nocedal, Jorge and Wright, Stephen J.. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York, NY: Springer, 2006, pp. XXII, 664 (cited on page 63).Google Scholar
Novikoff, A. B.. ‘On Convergence Proofs on Perceptrons’. In: Proceedings of the Symposium on the Mathematical Theory of Automata. Vol. 12. New York, NY: Polytechnic Institute of Brooklyn, 1962, pp. 615622 (cited on page 108).Google Scholar
Olah, Christopher. Understanding LSTM Networks. 2015. url: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (visited on 11/10/2019) (cited on page 171).Google Scholar
van den Oord, Aäron et al. ‘WaveNet: A Generative Model for Raw Audio’. In: CoRR abs/1609.03499 (2016) (cited on page 198).Google Scholar
Opitz, David and Maclin, Richard. ‘Popular Ensemble Methods: An Empirical Study’. In: Journal of Artificial Intelligence Research 11.1 (July 1999), pp. 169198 (cited on page 203).Google Scholar
Pearl, Judea. ‘Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach’. In: Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA: Association for the Advancement of Artificial Intelligence, 1982, pp. 133136 (cited on page 357).Google Scholar
Pearl, Judea. ‘Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning’. In: Proceedings of the Cognitive Science Society (CSS-7). 1985 (cited on page 343).Google Scholar
Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1988 (cited on pages 343, 350, 357).Google Scholar
Pearl, Judea. ‘Causal Inference in Statistics: An Overview’. In: Statistics Surveys 3 (Jan. 2009), pp. 96146. doi: 10.1214/09-SS057 (cited on pages 16, 347).Google Scholar
Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge, MA: Cambridge University Press, 2009 (cited on pages 16, 347).Google Scholar
Pearson, Karl. ‘On Lines and Planes of Closest Fit to Systems of Points in Space’. In: Philosophical Magazine 2 (1901), pp. 559572 (cited on page 80).Google Scholar
Peters, Jonas, Janzing, Dominik, and Schlkopf, Bernhard. Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA: MIT Press, 2017 (cited on pages 16, 347).Google Scholar
Plataniotis, K. N. and Hatzinakos, D.. ‘Gaussian Mixtures and Their Applications to Signal Processing’. In: Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real Time Systems. Ed. by Stergiopoulos, Stergios. Boca Raton, FL: CRC Press, 2000, Chapter 3 (cited on page 268).Google Scholar
Platt, John C.. ‘Fast Training of Support Vector Machines Using Sequential Minimal Optimization’. In: Advances in Kernel Methods. Ed. by Schölkopf, Bernhard, Burges, Christopher J. C., and Smola, Alexander J.. Cambridge, MA: MIT Press, 1999, pp. 185208 (cited on page 127).Google Scholar
Platt, John C., Cristianini, Nello, and Shawe-Taylor, John. ‘Large Margin DAGs for Multiclass Classification’. In: Advances in Neural Information Processing Systems 12. Ed. by Solla, S. A., Leen, T. K., and Müller, K.. Cambridge, MA: MIT Press, 2000, pp. 547553 (cited on page 127).Google Scholar
Pratt, L. Y.. ‘Discriminability-Based Transfer between Neural Networks’. In: Advances in Neural Information Processing Systems 5. Ed. by Hanson, S. J., Cowan, J. D., and Giles, C. L.. Burlington, MA: Morgan-Kaufmann, 1993, pp. 204211 (cited on page 16).Google Scholar
Press, S. James. Applied Multivariate Analysis. 2nd ed. Malabar, FL: R. E. Krieger, 1982 (cited on page 378).Google Scholar
Qian, Ning. ‘On the Momentum Term in Gradient Descent Learning Algorithms’. In: Neural Networks 12.1 (Jan. 1999), pp. 145151. doi: 10.1016/S0893-6080(98)00116-6 (cited on page 192).Google Scholar
Quinlan, J. R.. ‘Induction of Decision Trees’. In: Machine Learning 1.1 (Mar. 1986), pp. 81106. doi: 10.1023/A:1022643204877 (cited on page 205).Google Scholar
Rabiner, Lawrence R.. ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’. In: Proceedings of the IEEE 77.2 (1989), pp. 257286 (cited on pages 276, 357).Google Scholar
Rai, Piyush. Matrix Factorization and Matrix Completion. 2016. url: https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec14_slides.pdf (visited on 11/10/2019) (cited on page 144).Google Scholar
Rasmussen, Carl Edward and Williams, Christopher K. I.. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2005 (cited on pages 333, 339).Google Scholar
Ricci, Francesco, Rokach, Lior, and Shapira, Bracha. ‘Introduction to Recommender Systems Handbook’. In: Recommender Systems Handbook. Ed. by Ricci, Francesco et al. Boston, MA: Springer, 2011, pp. 135. doi: 10.1007/978-0-387-85820-3_1 (cited on page 141).Google Scholar
Rissanen, Jorma. ‘Modeling by Shortest Data Description.’ In: Automatica 14.5 (1978), pp. 465471 (cited on page 11).Google Scholar
Rocca, Joseph. Understanding Variational Autoencoders (VAEs). 2019. url: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73 (visited on 03/03/2020) (cited on page 306).Google Scholar
Rosenblatt, F.. ‘The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain’. In: Psychological Review (1958), pp. 65386 (cited on pages 2, 108).Google Scholar
Roweis, Sam T. and Saul, Lawrence K.. ‘Nonlinear Dimensionality Reduction by Locally Linear Embedding’. In: Science 290.5500 (2000), pp. 23232326. doi: 10.1126/science.290.5500.2323 (cited on page 87).Google Scholar
Rubinstein, R., Bruckstein, A. M., and Elad, M.. ‘Dictionaries for Sparse Representation Modeling’. In: Proceedings of the IEEE 98.6 (June 2010), pp. 10451057. doi: 10.1109/JPROC.2010.2040551 (cited on page 145).Google Scholar
Rue, Havard and Held, Leonhard. Gaussian Markov Random Fields: Theory and Applications (Monographs on Statistics and Applied Probability). Boca Raton, FL: Chapman & Hall/CRC, 2005 (cited on pages 344, 366).Google Scholar
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J.. ‘Learning Representations by Back-Propagating Errors’. In: Nature 323.6088 (1986), pp. 533536. doi: 10.1038/323533a0 (cited on pages 153, 176).Google Scholar
Rumelhart, David E., McClelland, James L., and et al., eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 2: Psychological and Biological Models. Cambridge, MA: MIT Press, 1986 (cited on page 2).Google Scholar
Rumelhart, David E., McClelland, James L., and PDP Research Group, eds. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986 (cited on page 2).Google Scholar
Russell, Stuart and Norvig, Peter. Artificial Intelligence: A Modern Approach. 3rd ed. Upper Saddle River, NJ: Prentice Hall, 2010 (cited on pages 1, 2).Google Scholar
Saha, Sumit. A Comprehensive Guide to Convolutional Neural Networks. 2018. url: http://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 (visited on 11/10/2019) (cited on page 169).Google Scholar
Salimans, Tim and Kingma, Diederik P.. ‘Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks’. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16. Barcelona, Spain: Curran Associates Inc., 2016, pp. 901909 (cited on pages 194, 195).Google Scholar
Samir, Mostafa. Machine Learning Theory—Part 2: Generalization Bounds. 2016. url: https://mostafa-samir.github.io/ml-theory-pt2/ (visited on 11/10/2019) (cited on page 103).Google Scholar
Sammon, John W.. ‘A Nonlinear Mapping for Data Structure Analysis’. In: IEEE Transactions on Computers 18.5 (1969), pp. 401409 (cited on page 88).Google Scholar
Samuel, A. L.. ‘Some Studies in Machine Learning Using the Game of Checkers’. In: IBM Journal of Research and Development 3.3 (July 1959), pp. 210229. doi: 10.1147/rd.33.0210 (cited on page 2).Google Scholar
Saul, Lawrence K., Jaakkola, Tommi, and Jordan, Michael I.. ‘Mean Field Theory for Sigmoid Belief Networks’. In: Journal of Artificial Intelligence Research 4 (1996), pp. 6176 (cited on page 326).Google Scholar
Schapire, Robert E.. ‘The Strength of Weak Learnability’. In: Machine Learning 5.2 (1990), pp. 197227. doi: 10.1023/A:1022648800760 (cited on pages 204, 209, 210).Google Scholar
Schapire, Robert E. et al. ‘Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods’. In: Proceedings of the Fourteenth International Conference on Machine Learning. ICML ‘97. San Francisco, CA: Morgan Kaufmann Publishers Inc., 1997, pp. 322330 (cited on pages 204, 214).Google Scholar
Schölkopf, Bernhard, Smola, Alexander, and Müller, Klaus-Robert. ‘Nonlinear Component Analysis as a Kernel Eigenvalue Problem’. In: Neural Computation 10.5 (July 1998), pp. 12991319. doi: 10.1162/089976698300017467 (cited on page 125).Google Scholar
Schuster, M. and Paliwal, K. K.. ‘Bidirectional Recurrent Neural Networks’. In: IEEE Transactions on Signal Processing 45.11 (Nov. 1997), pp. 26732681. doi: 10.1109/78.650093 (cited on page 171).Google Scholar
Seide, Frank, Li, Gang, and Yu, Dong. ‘Conversational Speech Transcription Using Context-Dependent Deep Neural Networks’. In: Proceedings of Interspeech. Baixas, France: International Speech Communication Association, 2011, pp. 437440 (cited on page 276).Google Scholar
Settles, Burr. Active Learning Literature Survey. Computer Sciences Technical Report 1648. Madison, WI: University of Wisconsin–Madison, 2009 (cited on page 17).Google Scholar
Shalev-Shwartz, Shai and Ben-David, Shai. Understanding Machine Learning: From Theory to Algorithms. Cambridge, England: Cambridge University Press, 2014 (cited on pages 11, 14).Google Scholar
Shalev-Shwartz, Shai and Singer, Yoram. ‘A New Perspective on an Old Perceptron Algorithm’. In: International Conference on Computational Learning Theory. New York, NY: Springer, 2005, pp. 264278 (cited on page 111).Google Scholar
Shannon, C. E.. ‘A Mathematical Theory of Communication’. In: Bell System Technical Journal 27.3 (1948), pp. 379423. doi: 10.1002/j.1538-7305.1948.tb01338.x (cited on page 41).Google Scholar
Shor, N. Z., Kiwiel, Krzysztof C., and Ruszcayński, Andrzej. Minimization Methods for Non-Differentiable Functions. Berlin, Germany: Springer-Verlag, 1985 (cited on page 71).Google Scholar
Silver, David et al. ‘Mastering the Game of Go with Deep Neural Networks and Tree Search’. In: Nature 529.7587 (Jan. 2016), pp. 484489. doi: 10.1038/nature16961 (cited on page 16).Google Scholar
Slater, Morton. Lagrange Multipliers Revisited. Cowles Foundation Discussion Papers 80. New Haven, CT: Cowles Foundation for Research in Economics, Yale University, 1959 (cited on page 57).Google Scholar
Smolensky, P.. ‘Information Processing in Dynamical Systems: Foundations of Harmony Theory’. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. Ed. by Rumelhart, David E., McClelland, James L., and PDP Research Group. Cambridge, MA: MIT Press, 1986, pp. 194281 (cited on pages 366, 370).Google Scholar
Sollich, Peter and Krogh, Anders. ‘Learning with Ensembles: How Overfitting Can Be Useful.’ In: Advances in Neural Information Processing Systems 7. Ed. by Touretzky, David S., Mozer, Michael, and Hasselmo, Michael E.. Cambridge, MA: MIT Press, 1995, pp. 190196 (cited on page 203).Google Scholar
Soltani, Rohollah and Jiang, Hui. ‘Higher Order Recurrent Neural Networks’. In: CoRR abs/1605.00064 (2016) (cited on pages 171, 201).Google Scholar
Sorenson, H. W. and Alspach, D. L.. ‘Recursive Bayesian Estimation Using Gaussian Sums’. In: Automatica 7.4 (1971), pp. 465479. doi: https://doi.org/10.1016/0005-1098(71)90097-5 (cited on page 268).Google Scholar
Srivastava, Nitish et al. ‘Dropout: A Simple Way to Prevent Neural Networks from Overfitting’. In: Journal of Machine Learning Research 15.1 (Jan. 2014), pp. 19291958 (cited on page 195).Google Scholar
Stephenson, W.. ‘Technique of Factor Analysis’. In: Nature 136.297 (1935). doi: https://doi.org/10.1038/136297b0 (cited on pages 293, 294, 296, 298).Google Scholar
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. ‘Sequence to Sequence Learning with Neural Networks’. In: Advances in Neural Information Processing Systems 27. Ed. by Ghahramani, Z. et al. Red Hook, NY: Curran Associates, Inc., 2014, pp. 31043112 (cited on page 198).Google Scholar
Sutton, C. and McCallum, A.. ‘An Introduction to Conditional Random Fields for Relational Learning’. In: Introduction to Statistical Relational Learning. Ed. by Getoor, Lise and Taskar, Ben. Cambridge, MA: MIT Press, 2007 (cited on pages 366, 369).Google Scholar
Sutton, Richard S. and Barto, Andrew G.. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press, 2018 (cited on page 15).Google Scholar
Tenenbaum, Joshua B., de Silva, Vin, and Langford, John C.. ‘A Global Geometric Framework for Nonlinear Dimensionality Reduction’. In: Science 290.5500 (2000), p. 2319 (cited on page 88).Google Scholar
Tibshirani, Robert. ‘Regression Shrinkage and Selection Via the LASSO’. In: Journal of the Royal Statistical Society, Series B 58 (1994), pp. 267288 (cited on page 140).Google Scholar
Tipping, M. E. and Bishop, Christopher. ‘Mixtures of Probabilistic Principal Component Analyzers’. In: Neural Computation 11 (Jan. 1999), pp. 443482 (cited on pages 297, 298).Google Scholar
Tipping, Michael E. and Bishop, Chris M.. ‘Probabilistic Principal Component Analysis’. In: Journal of the Royal Statistical Society, Series B 61.3 (1999), pp. 611622 (cited on pages 293, 294, 296).Google Scholar
Titterington, D. M., Smith, A. F. M., and Makov, U. E.. Statistical Analysis of Finite Mixture Distributions. New York, NY: Wiley, 1985 (cited on page 257).Google Scholar
Turney, Peter D. and Pantel, Patrick. ‘From Frequency to Meaning: Vector Space Models of Semantics’. In: Journal of Artificial Intelligence Research 37.1 (Jan. 2010), pp. 141188 (cited on pages 142, 149).Google Scholar
Vanschoren, Joaquin. ‘Meta-Learning’. In: Automated Machine Learning: Methods, Systems, Challenges. Ed. by Hutter, Frank, Kotthoff, Lars, and Vanschoren, Joaquin. Cham, Switzerland: Springer International Publishing, 2019, pp. 3561. doi: 10.1007/978-3-030-05318-5_2 (cited on page 16).Google Scholar
Vapnik, Vladimir N.. The Nature of Statistical Learning Theory. Berlin, Germany: Springer-Verlag, 1995 (cited on pages 102, 103).Google Scholar
Vapnik, Vladimir N.. Statistical Learning Theory. Hoboken, NJ: Wiley-Interscience, 1998 (cited on pages 102, 103).Google Scholar
Vaswani, Ashish et al. ‘Attention Is All You Need’. In: Advances in Neural Information Processing Systems 30. Ed. by Von Luxburg, U.. Red Hook, NY: Curran Associates, Inc., 2017, pp. 59986008 (cited on pages 164, 172, 173, 199).Google Scholar
Viterbi, Andrew J.. ‘Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.’ In: IEEE Transactions on Information Theory 13.2 (1967), pp. 260269 (cited on pages 279, 357).Google Scholar
Waibel, Alexander et al. ‘Phoneme Recognition Using Time-Delay Neural Networks’. In: IEEE Transactions on Acoustics, Speech, and Signal Processing 37.3 (1989), pp. 328339 (cited on page 161).Google Scholar
Waterhouse, Steve R., MacKay, David, and Robinson, Anthony J.. ‘Bayesian Methods for Mixtures of Experts’. In: Advances in Neural Information Processing Systems 8. Ed. by Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E.. Cambridge, MA: MIT Press, 1996, pp. 351357 (cited on page 326).Google Scholar
Watkins, C. J. C. H.. ‘Learning from Delayed Rewards’. PhD thesis. Oxford, England: King's College, 1989 (cited on page 15).Google Scholar
Werbos, P. J.. ‘Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences’. PhD thesis. Cambridge, MA: Harvard University, 1974 (cited on pages 153, 176).Google Scholar
Weston, J. and Watkins, C.. ‘Support Vector Machines for Multiclass Pattern Recognition’. In: Proceedings of the Seventh European Symposium on Artificial Neural Networks. European Symposium on Artificial Neural Networks, Apr. 1999 (cited on page 127).Google Scholar
Williams, C. K. I. and Barber, D.. ‘Bayesian Classification with Gaussian Processes’. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20.12 (1998), pp. 13421351 (cited on page 339).Google Scholar
Wolpert, David H.. ‘Stacked Generalization’. In: Neural Networks 5.2 (1992), pp. 241259. doi: https://doi.org/10.1016/S0893-6080(05)80023-1 (cited on page 204).Google Scholar
Wolpert, David H.. ‘The Lack of a Priori Distinctions between Learning Algorithms’. In: Neural Computation 8.7 (Oct. 1996), pp. 13411390. doi: 10.1162/neco.1996.8.7.1341 (cited on page 11).Google Scholar
Yamaguchi, Kouichi et al. ‘A Neural Network for Speaker-Independent Isolated Word Recognition’. In: First International Conference on Spoken Language Processing (ICSLP 90). International Symposium on Computer Architecture, 1990, pp. 10771080 (cited on page 159).Google Scholar
Yang, Liu and Jin, Rong. Distance Metric Learning: A Comprehensive Survey. 2006. url: https://www.cs.cmu.edu/~liuy/frame_survey_v2.pdf (cited on page 13).Google Scholar
Young, Steve. ‘A Review of Large Vocabulary Continuous Speech Recognition’. In: IEEE Signal Processing Magazine 13.5 (Sept. 1996), pp. 4557. doi: 10.1109/79.536824 (cited on page 276).Google Scholar
Young, Steve J., Russell, N. H., and Thornton, J. H. S. Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems. Tech. rep. Cambridge, MA: Cambridge University Engineering Department, 1989 (cited on page 280).Google Scholar
Young, Steve et al. The HTK Book. Tech. rep. Cambridge, MA: Cambridge University Engineering Department, 2002 (cited on page 286).Google Scholar
Zakka, Kevin. Deriving the Gradient for the Backward Pass of Batch Normalization. 2016. url: http://kevinzakka.github.io/2016/09/14/batch_normalization/ (visited on 11/20/2019) (cited on page 183).Google Scholar
Zeiler, Matthew D.. ‘ADADELTA: An Adaptive Learning Rate Method’. In: CoRR abs/1212.5701 (2012) (cited on page 192).Google Scholar
Zhang, Shiliang, Jiang, Hui, and Dai, Lirong. ‘Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Learn Neural Networks’. In: Journal of Machine Learning Research 17.37 (2016), pp. 133. doi: http://jmlr.org/papers/v17/15-335.html (cited on pages 293, 294, 302, 303, 379).Google Scholar
Zhang, Shiliang et al. ‘Feedforward Sequential Memory Networks: A New Structure to Learn Long-Term Dependency’. In: CoRR abs/1512.08301 (2015) (cited on pages 161, 202).Google Scholar
Zhang, Shiliang et al. ‘Rectified Linear Neural Networks with Tied-Scalar Regularization for LVCSR’. In: INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6–10, 2015. International Speech Communication Association, 2015, pp. 26352639 (cited on page 194).Google Scholar
Zhang, Shiliang et al. ‘The Fixed-Size Ordinally-Forgetting Encoding Method for Neural Network Language Models’. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China: Association for Computational Linguistics, July 2015, pp. 495500. doi: 10.3115/v1/P15-2081 (cited on page 78).Google Scholar
Zhang, Shiliang et al. ‘Nonrecurrent Neural Structure for Long-Term Dependence’. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (2017), pp. 871884 (cited on page 161).Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Bibliography
  • Hui Jiang, York University, Toronto
  • Book: Machine Learning Fundamentals
  • Online publication: 18 November 2021
  • Chapter DOI: https://doi.org/10.1017/9781108938051.022
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Bibliography
  • Hui Jiang, York University, Toronto
  • Book: Machine Learning Fundamentals
  • Online publication: 18 November 2021
  • Chapter DOI: https://doi.org/10.1017/9781108938051.022
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Bibliography
  • Hui Jiang, York University, Toronto
  • Book: Machine Learning Fundamentals
  • Online publication: 18 November 2021
  • Chapter DOI: https://doi.org/10.1017/9781108938051.022
Available formats
×