Skip to main content Accessibility help
×
Hostname: page-component-848d4c4894-75dct Total loading time: 0 Render date: 2024-06-03T09:03:25.718Z Has data issue: false hasContentIssue false

References

Published online by Cambridge University Press:  17 May 2022

Sean Meyn
Affiliation:
University of Florida
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, A. and Dekel, O.. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proc. COLT, pages 2840, 2010.Google Scholar
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G.. Optimality and approximation with policy gradient methods in Markov decision processes. In Proc. COLT, pages 6466, 2020.Google Scholar
Agrawal, R.. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pages 10541078, 1995.Google Scholar
Alekseev, V. M.. An estimate for the perturbations of the solutions of ordinary differential equations (Russian). Westnik Moskov Unn. Ser, 1:2836, 1961.Google Scholar
Alvarez, F., Attouch, H., Bolte, J., and Redont, P.. A second-order gradient-like dissipative dynamical system with Hessian-driven damping: application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747779, 2002.CrossRefGoogle Scholar
Amari, S.-I. and Douglas, S. C.. Why natural gradient? In ICASSP’98, volume 2, pages 12131216. IEEE, 1998.Google Scholar
Anderson, B. D. O. and Moore, J. B.. Optimal Control: Linear Quadratic Methods. Prentice Hall, Englewood Cliffs, NJ, 1990.Google Scholar
Andrew, L. L., Lin, M., and Wierman, A.. Optimality, fairness, and robustness in speed scaling designs. SIGMETRICS Perform. Eval. Rev., 38(1):3748, June 2010.CrossRefGoogle Scholar
Anschel, O., Baram, N., and Shimkin, N.. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proc. ICML, pages 176185. JMLR.org, 2017.Google Scholar
Arapostathis, A., Borkar, V. S., Fernandez-Gaucherand, E., Ghosh, M. K., and Marcus, S. I.. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim., 31:282344, 1993.Google Scholar
Ariyur, K. B. and Krstić, M.. Real Time Optimization by Extremum Seeking Control. John Wiley & Sons, Inc., New York, NY, 2003.CrossRefGoogle Scholar
Asmussen, S. and Glynn, P. W.. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, NY, 2007.CrossRefGoogle Scholar
Åström, K. J.. Optimal control of Markov processes with incomplete state information I. J. of Mathematical Analysis and Applications, 10:174205, 1965.Google Scholar
Åström, K. J. and Furuta, K.. Swinging up a pendulum by energy control. Automatica, 36(2):287295, 2000.CrossRefGoogle Scholar
Åström, K. J. and Murray, R. M.. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, Princeton, NJ, 2nd ed., 2020.Google Scholar
Attouch, H., Goudou, X., and Redont, P.. The heavy ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):134, 2000.CrossRefGoogle Scholar
Auer, P., Cesa-Bianchi, N., and Fischer, P.. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235256, 2002.Google Scholar
Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H.. Speedy Q-learning. In Proc. Advances in Neural Information Processing Systems, pages 24112419, 2011.Google Scholar
Bach, F.. Learning Theory from First Principles. www.di.ens.fr/~fbach/ltfp book.pdf, 2021.Google Scholar
Bach, F. and Moulines, E.. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Proc. Advances in Neural Information Processing Systems, volume 26, pages 773781, 2013.Google Scholar
Baird, L.. Residual algorithms: reinforcement learning with function approximation. In Prieditis, A. and Russell, S., editors, Proc. Machine Learning, pages 3037. Morgan Kaufmann, San Francisco, CA, 1995.Google Scholar
Baird, L. C.. Reinforcement learning in continuous time: advantage updating. In Proc. of Intl. Conference on Neural Networks, volume 4, pages 24482453. IEEE, 1994.CrossRefGoogle Scholar
Baird III, L. C.. Reinforcement Learning through Gradient Descent. PhD thesis, US Air Force Academy, 1999.Google Scholar
Ball, F., Larédo, C., Sirl, D., and Tran, V. C.. Stochastic Epidemic Models with Inference, volume 2255. Springer Nature, Cham, 2019.Google Scholar
Bansal, N., Kimbrel, T., and Pruhs, K.. Speed scaling to manage energy and temperature. J. ACM, 54(1):139, March 2007.CrossRefGoogle Scholar
Barto, A., Sutton, R., and Anderson, C.. Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man and Cybernetics, 13(5):835846, 1983.Google Scholar
Barto, A. G., Sutton, R. S., and Watkins, C. J. C. H.. Learning and sequential decision making. In Gabriel, M. and Moore, J. W., editors Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 539602, MIT Press, Cambridge, MA, 1989.Google Scholar
Bas Serrano, J., Curi, S., Krause, A., and Neu, G.. Logistic Q-learning. In Banerjee, A. and Fukumizu, K., editors, Proc. of the Intl. Conference on Artificial Intelligence and Statistics, volume 130, pages 36103618, April 13–15 2021.Google Scholar
Basar, T., Meyn, S., and Perkins, W. R.. Lecture notes on control system theory and design. arXiv e-print 2007.01367, 2010.Google Scholar
Baumann, N.. Too fast to fail: is high-speed trading the next Wall Street disaster? Mother Jones, January/February 2013.Google Scholar
Baxter, J. and Bartlett, P. L.. Direct gradient-based reinforcement learning: I. gradient estimation algorithms. Technical report, Australian National University, 1999.Google Scholar
Baxter, J. and Bartlett, P. L.. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319350, 2001.CrossRefGoogle Scholar
Beck, J.. Strong Uniformity and Large Dynamical Systems. World Scientific, Hackensack, NJ, 2017.CrossRefGoogle Scholar
Bellman, R.. The stability of solutions of linear differential equations. Duke Math. J., 10(4):643647, 1943.CrossRefGoogle Scholar
Bellman, R.. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.Google ScholarPubMed
Bellman, R., Bentsman, J., and Meerkov, S. M.. Stability of fast periodic systems. In Proc. of the American Control Conf., volume 3, pages 13191320. IEEE, 1984.Google Scholar
Benaïm, M.. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, pages 168. Springer, Berlin, 1999.Google Scholar
Benveniste, A., Métivier, M., and Priouret, P.. Adaptive Algorithms and Stochastic Approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.CrossRefGoogle Scholar
Benveniste, A., Métivier, M., and Priouret, P.. Adaptive Algorithms and Stochastic Approximations. Vol. 22. Springer Science & Business Media, Berlin, Heidelberg, 2012.Google Scholar
Bernstein, A., Chen, Y., Colombino, M., Dall’Anese, E., Mehta, P., and Meyn, S.. Optimal rate of convergence for quasi-stochastic approximation. arXiv:1903.07228, 2019.Google Scholar
Bernstein, A., Chen, Y., Colombino, M., Dall’Anese, E., Mehta, P., and Meyn, S.. Quasi-stochastic approximation and off-policy reinforcement learning. In Proc. of the Conf. on Dec. and Control, pages 52445251, March 2019.Google Scholar
Bertsekas, D.. Multiagent rollout algorithms and reinforcement learning. arXiv preprint arXiv:1910.00120, 2019.Google Scholar
Bertsekas, D. and Shreve, S.. Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, Belmont, MA 1996.Google Scholar
Bertsekas, D. and Tsitsiklis, J. N.. Neuro-Dynamic Programming. Athena Scientific, Cambridge, MA, 1996.Google Scholar
Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume II. Athena Scientific, Belmont, MA, 4th ed., 2012.Google Scholar
Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume 1. Athena Scientific, Belmont, MA, 4th ed., 2017.Google ScholarPubMed
Bertsekas, D. P.. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, MA, 2019.Google Scholar
Bhandari, J. and Russo, D.. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv: 1906.01786, 2019.Google Scholar
Bhandari, J., Russo, D., and Singal, R.. A finite time analysis of temporal difference learning with linear function approximation. In Proc. COLT, pages 16911692, 2018.Google Scholar
Bhatnagar, S.. Simultaneous perturbation and finite difference methods. Wiley Encyclopedia of Operations Research and Management Science, https://onlinelibrary.wiley.com/doi/10.1002/9780470400531.eorms0784, 2010.Google Scholar
Bhatnagar, S. and Borkar, V. S.. Multiscale chaotic SPSA and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568580, 2003.Google Scholar
Bhatnagar, S., Fu, M. C., Marcus, S. I., and Wang, I.-J.. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180209, 2003.Google Scholar
Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S.. Incremental natural actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems, pages 105112, 2008.Google Scholar
Bhatnagar, S., Prasad, H., and Prashanth, L.. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. Lecture Notes in Control and Information Sciences. Springer, London, 2013.Google Scholar
Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., and Szepesvári, C.. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proc. Advances in Neural Information Processing Systems, pages 12041212, 2009.Google Scholar
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M.. Natural actor–critic algorithms. Automatica, 45(11):24712482, 2009.Google Scholar
Bishop, C. M.. Pattern Recognition and Machine Learning. Springer, 2006.Google Scholar
Blum, J. R.. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, 25(4): 737744, 1954.CrossRefGoogle Scholar
Borkar, V. and Meyn, S. P.. Oja's algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control. Automatica, 48(10):25122519, 2012.CrossRefGoogle Scholar
Borkar, V. and Varaiya, P.. Adaptive control of Markov chains, i: finite parameter set. IEEE Trans. Automat. Control, 24(6):953957, 1979.CrossRefGoogle Scholar
Borkar, V. and Varaiya, P.. Identification and adaptive control of Markov chains. SIAM J. Control Optim., 20(4):470489, 1982.Google Scholar
Borkar, V. S.. Identification and Adaptive Control of Markov Chains. PhD thesis, University of California, Berkeley, 1980.Google Scholar
Borkar, V. S.. Convex analytic methods in Markov decision processes. In Handbook of Markov Decision Processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 347375. Kluwer Acad. Publ., Boston, MA, 2002.Google Scholar
Borkar, V. S.. Reinforcement learning – a bridge between numerical methods and Markov Chain Monte Carlo. In Sastry, N. S. N., Rajeev, B., Delampady, M., and Rao, T. S. S. R. K., editors, Perspectives in Mathematical Sciences, pages 7191. World Scientific, Singapore, 2009.Google Scholar
Borkar, V. S.. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India, and Cambridge, UK, 2008.Google Scholar
Borkar, V. S.. Stochastic Approximation: A Dynamical Systems Viewpoint (2nd ed., to appear). Hindustan Book Agency, Delhi, India, and Cambridge, UK, 2020.Google Scholar
Borkar, V. S. and Gaitsgory, V.. Linear programming formulation of long-run average optimal control problem. Journal of Optimization Theory and Applications, 181(1):101125, 2019.CrossRefGoogle Scholar
Borkar, V. S., Gaitsgory, V., and Shvartsman, I.. LP formulations of discrete time long-run average optimal control problems: the non ergodic case. SIAM Journal on Control and Optimization, 57(3):17831817, 2019.Google Scholar
Borkar, V. S. and Meyn, S. P.. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447469, 2000.CrossRefGoogle Scholar
Boyan, J. A.. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2–3): 233246, 2002.CrossRefGoogle Scholar
Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan, V.. Linear Matrix Inequalities in System and Control Theory, volume 15. SIAM, 1994.CrossRefGoogle Scholar
Boyd, S., Parikh, N., and Chu, E.. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Publishers Inc, Norwell, MA, 2011.Google Scholar
Boyd, S. and Vandenberghe, L.. Convex Optimization, 1st edition. Cambridge University Press, New York, 1st ed., 2004.CrossRefGoogle Scholar
Bradtke, S., Ydstie, B., and Barto, A.. Adaptive linear quadratic control using policy iteration. In Proc. of the American Control Conf., volume 3, pages 34753479, 1994.Google Scholar
Bradtke, S. J. and Barto, A. G.. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):3357, 1996.Google Scholar
Brogan, W. L.. Modern Control Theory. Pearson, 3rd ed., 1990.Google Scholar
Bu, J., Mesbahi, A., Fazel, M., and Mesbahi, M.. LQR through the lens of first order methods: discrete-time case. arXiv e-prints, page arXiv:1907.08921, 2019.Google Scholar
Bubeck, S. and Cesa-Bianchi, N.. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1122, 2012.Google Scholar
Butcher, J. C.. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, New York, NY 2016.Google Scholar
Caines, P. E.. Linear Stochastic Systems. John Wiley & Sons, New York, NY, 1988.Google Scholar
Caines, P. E.. Mean field games. In Baillieul, J. and Samad, T., editors, Encyclopedia of Systems and Control, pages 706712. Springer London, London, UK, 2015.Google Scholar
Chatterjee, D., Patra, A., and Joglekar, H. K.. Swing-up and stabilization of a cart–pendulum system under restricted cart track length. Systems & Control Letters, 47(4):355364, 2002.CrossRefGoogle Scholar
Chen, H. and Guo, L.. Identification and Stochastic Adaptive Control. Birkhauser, Boston, MA, 1991.CrossRefGoogle Scholar
Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D.. Neural ordinary differential equations. In Proc. Advances Neural Information Processing Systems, volume 32, pages 65726583, 2018.Google Scholar
Chen, S., Bernstein, A., Devraj, A., and Meyn, S.. Stability and acceleration for quasi stochastic approximation. arXiv:2009.14431, 2020.Google Scholar
Chen, S., Devraj, A., Bernstein, A., and Meyn, S.. Accelerating optimization and reinforcement learning with quasi stochastic approximation. In Proc. of the American Control Conf., pages 19651972, May 2021.Google Scholar
Chen, S., Devraj, A., Bernstein, A., and Meyn, S.. Revisiting the ODE method for recursive algorithms: fast convergence using quasi stochastic approximation. Journal of Systems Science and Complexity. Special Issue on Advances on Fundamental Problems in Control Systems, in Honor of Prof. Lei Guo's 60th birthday, 34(5):16811702, 2021.Google Scholar
Chen, S., Devraj, A., Borkar, V., Kontoyiannis, I., and Meyn, S.. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. Submitted for publication, 2021.Google Scholar
Chen, S., Devraj, A. M., Bušić, A., and Meyn, S.. Explicit mean-square error bounds for Monte-Carlo and linear stochastic approximation. In Chiappa, S. and Calandra, R., editors, Proc. of AISTATS, volume 108, pages 41734183, 2020.Google Scholar
Chen, S., Devraj, A. M., Lu, F., Busic, A., and Meyn, S.. Zap Q-Learning with nonlinear function approximation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, and arXiv e-prints 1910.05405, volume 33, pages 1687916890, 2020.Google Scholar
Chen, T., Hua, Y., and Yan, W.-Y.. Global convergence of Oja's subspace algorithm for principal component extraction. IEEE Trans. Neural Networks, 9(1):5867, January 1998.CrossRefGoogle ScholarPubMed
Chen, W., Huang, D., Kulkarni, A. A., et al. Approximate dynamic programming using fluid and diffusion approximations with applications to power management. In Proc. of the 48th IEEE Conf. on Dec. and Control; Held Jointly with the 2009 28th Chinese Control Conference, pages 35753580, 2009.Google Scholar
Chen, Y., Bernstein, A., Devraj, A., and Meyn, S.. Model-free primal-dual methods for network optimization with application to real-time optimal power flow. In Proc. of the American Control Conf., pages 31403147, September 2019.CrossRefGoogle Scholar
Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M.. A Lyapunov-based approach to safe reinforcement learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Proc. Advances in Neural Information Processing Systems, pages 80928101, 2018.Google Scholar
Chung, K. L. et al. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463483, 1954.Google Scholar
Colombino, M., Dall’Anese, E., and Bernstein, A.. Online optimization as a feedback controller: stability and tracking. Trans. on Control of Network Systems, 7(1):422432, 2020.CrossRefGoogle Scholar
Cover, T. M. and Thomas, J. A.. Elements of Information Theory. John Wiley & Sons Inc., New York, NY, 1991.Google Scholar
Dai, J. G.. On positive Harris recurrence of multiclass queueing networks: a unified approach via fluid limit models. Ann. Appl. Probab., 5(1):4977, 1995.CrossRefGoogle Scholar
Dai, J. G. and Meyn, S. P.. Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Trans. Automat. Control, 40:18891904, 1995.Google Scholar
Dai, J. G. and Vande Vate, J. H.. The stability of two-station multi-type fluid networks. Operations Res., 48:721744, 2000.CrossRefGoogle Scholar
Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S.. Concentration bounds for two timescale stochastic approximation with applications to reinforcement learning. Proc. of the Conference on Computational Learning Theory, pages 135, 2017.Google Scholar
de Farias, D. P. and Van Roy, B.. The linear programming approach to approximate dynamic programming. Operations Res., 51(6):850865, 2003.Google Scholar
de Farias, D. P. and Van Roy, B.. On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res., 29(3):462478, 2004.CrossRefGoogle Scholar
de Farias, D. P. and Van Roy, B.. A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res., 31(3):597620, 2006.CrossRefGoogle Scholar
Dembo, A. and Zeitouni, O.. Large Deviations Techniques and Applications. Springer-Verlag, New York, NY, 2nd ed., 1998.Google Scholar
Derman, C.. Finite State Markovian Decision Processes, volume 67 of Mathematics in Science and Engineering. Academic Press, Inc., Orlando, FL, 1970.Google Scholar
Devraj, A. M.. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019.Google Scholar
Devraj, A. M., Bušić, A., and Meyn, S.. On matrix momentum stochastic approximation and applications to Q-learning. In Allerton Conference on Communication, Control, and Computing, pages 749756, September 2019.CrossRefGoogle Scholar
Devraj, A. M., Bušić, A., and Meyn, S.. Zap Q-Learning – a user's guide. In Proc. of the Fifth Indian Control Conference, https://par.nsf.gov/servlets/purl/10211835, January 911 2019.Google Scholar
Devraj, A. M., Bušić, A., and Meyn, S.. Fundamental design principles for reinforcement learning algorithms. In Vamvoudakis, K. G., Wan, Y., Lewis, F. L., and Cansever, D., editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision and Control (SSDC) series (volume 325). Springer, 2021.Google Scholar
Devraj, A. M., Kontoyiannis, I., and Meyn, S. P.. Differential temporal difference learning. IEEE Trans. Automat. Control, 66(10): 46524667, doi: 10.1109/TAC.2020.3033417. October 2021.CrossRefGoogle Scholar
Devraj, A. M. and Meyn, S. P.. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.Google Scholar
Devraj, A. M. and Meyn, S. P.. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages, 22322241, 2017.Google Scholar
Devraj, A. M. and Meyn, S. P.. Q-learning with uniformly bounded variance: large discounting is not a barrier to fast learning. arXiv e-prints, pages arXiv:2002.10301 (and to appear, IEEE Trans Auto Control), February 2020.Google Scholar
Diaconis, P.. The Markov chain Monte Carlo revolution. Bull. Amer. Math. Soc. (N.S.), 46(2): 179205, 2009.CrossRefGoogle Scholar
Ding, D. and Jovanović, M. R.. Global exponential stability of primal-dual gradient flow dynamics based on the proximal augmented Lagrangian. In Proc. of the American Control Conf., pages 34143419. IEEE, 2019.Google Scholar
Douc, R., Moulines, E., Priouret, P., and Soulier, P.. Markov Chains. Springer, Cham, 2018.Google Scholar
Douc, R., Moulines, É., and Stoffer, D.. Nonlinear Time Series : Theory, Methods and Applications with R Examples. Texts in Statistical Science. Chapman et Hall–CRC Press, 2014.CrossRefGoogle Scholar
Duffy, K. and Meyn, S.. Large deviation asymptotics for busy periods. Stochastic Systems, 4(1): 300319, 2014.CrossRefGoogle Scholar
Duffy, K. R. and Meyn, S. P.. Most likely paths to error when estimating the mean of a reflected random walk. Performance Evaluation, 67(12):12901303, 2010.Google Scholar
Dupree, K., Patre, P. M., Johnson, M., and Dixon, W. E.. Inverse optimal adaptive control of a nonlinear Euler–Lagrange system, Part I: Full state feedback. In Proc. of the Conference on Decision and Control, Held Jointly with Chinese Control Conference, pages 321326, 2009.CrossRefGoogle Scholar
Durrett, R.. Stochastic spatial models. SIAM Review, 41(4):677718, 1999.CrossRefGoogle Scholar
Dynkin, E. B. and Yushkevich, A. A.. Controlled Markov Processes, volume 235 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1979. Translated from the Russian original by Danskin, J. M. and Holland, C..Google Scholar
Even-Dar, E. and Mansour, Y.. Learning rates for Q-learning. J. of Machine Learning Research, 5:125, 2003.Google Scholar
Fabian, V. et al. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):13271332, 1968.Google Scholar
Farahmand, A.-M. and Ghavamzadeh, M.. PID accelerated value iteration algorithm. In Meila, M. and Zhang, T., editors, Proc. ICML, volume 139, pages 31433153, July 18–24 2021.Google Scholar
Fazlyab, M., Ribeiro, A., Morari, M., and Preciado, V. M.. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM Journal on Optimization, 28(3):26542689, 2018.CrossRefGoogle Scholar
Feinberg, E. and Shwartz, A., editors. Markov Decision Processes: Models, Methods, Directions, and Open Problems. Kluwer Acad. Publ., Holland, 2001.Google Scholar
Feinberg, E. A. and Shwartz, A., editors. Handbook of Markov Decision processes. Intl. Series in Operations Research & Management Science, 40. Kluwer Academic Publishers, Boston, MA, 2002. Methods and applications.Google Scholar
Feintuch, A. and Francis, B.. Infinite chains of kinematic points. Automatica, 48(5):901908, 2012.CrossRefGoogle Scholar
Feng, Y., Li, L., and Liu, Q.. A kernel loss for solving the Bellman equation. In Proc. Advances in Neural Information Processing Systems, pages 1545615467, 2019.Google Scholar
Finlay, L., Gaitsgory, V., and Lebedev, I.. Duality in linear programming problems related to deterministic long run average problems of optimal control. SIAM Journal on Control and Optimization, 47(4):16671700, 2008.Google Scholar
Flegal, J. M. and Jones, G. L.. Batch means and spectral variance estimators in Markov chain Monte Carlo. Annals of Statistics, 38(2):10341070, 04 2010.CrossRefGoogle Scholar
Fort, G., Moulines, E., Meyn, S. P., and Priouret, P.. ODE methods for Markov chain stability with applications to MCMC. In Valuetools ’06: Proceedings of the 1st International Conference on Performance Evaluation Methodolgies and Tools, page 42, ACM Press, New York, NY, 2006.Google Scholar
Foster, F. G.. On Markoff chains with an enumerable infinity of states. Proc. Cambridge Phil. Soc., 47:587591, 1952.CrossRefGoogle Scholar
Fradkov, A. and Polyak, B. T.. Adaptive and robust control in the USSR. IFAC–PapersOnLine, 53(2):13731378, 2020. 21th IFAC World Congress.Google Scholar
Furuta, K., Yamakita, M., and Kobayashi, S.. Swing up control of inverted pendulum. In Proc. Intl. Conference on Industrial Electronics, Control and Instrumentation, pages 21932198. IEEE, 1991.Google Scholar
Gagniuc, P. A.. Markov Chains: From Theory to Implementation and Experimentation. John Wiley & Sons, New York, NY, 2017.Google Scholar
Gaitsgory, V., Parkinson, A., and Shvartsman, I.. Linear programming formulations of deterministic infinite horizon optimal control problems in discrete time. Discrete and Continuous Dynamical Systems – Series B, 22(10):38213838, 2017.Google Scholar
Gaitsgory, V. and Quincampoix, M.. On sets of occupational measures generated by a deterministic control system on an infinite time horizon. Nonlinear Analysis: Theory, Methods and Applications, 88:2741, 2013.Google Scholar
George, J. M. and Harrison, J. M.. Dynamic control of a queue with adjustable service rate. Operations Res., 49(5):720731, September 2001.CrossRefGoogle Scholar
Glynn, P. W.. Stochastic approximation for Monte Carlo optimization. In Proc. of the 18th Conference on Winter Simulation, pages 356365, 1986.Google Scholar
Glynn, P. W.. Likelihood ratio gradient estimation: an overview. In Proc. of the Winter Simulation Conference, pages 366375, 1987.Google Scholar
Glynn, P. W. and Meyn, S. P.. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916931, 1996.CrossRefGoogle Scholar
Goodwin, G. C. and Sin, K. S.. Adaptive Filtering Prediction and Control. Prentice Hall, Englewood Cliffs, NJ, 1984.Google Scholar
Gordon, G. J.. Stable function approximation in dynamic programming. In Proc. ICML (see also the full-length technical report, CMU-CS-95-103), pages 261268. Elsevier, Netherlands, 1995.Google Scholar
Gordon, G. J.. Reinforcement learning with function approximation converges to a region. In Proc. of the 13th Intl. Conference on Neural Information Processing Systems, pages 9961002, Cambridge, MA, 2000.Google Scholar
Gosavi, A.. Simulation-Based Optimization. Springer, Berlin, 2015.CrossRefGoogle Scholar
Graham, R. L., Knuth, D. E., and Patashnik, O.. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 2nd ed., 1994.Google Scholar
Greenemeier, L.. AI versus AI: self-taught AlphaGo Zero vanquishes its predecessor. Scientific American, 371(4), www.scientificamerican.com/article/ai-versus-ai-self-taught-alphago-zero-vanquishes-its-predecessor/, October 2017.Google Scholar
Greensmith, E., Bartlett, P. L., and Baxter, J.. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:14711530, 2004.Google Scholar
Guan, P., Raginsky, M., and Willett, R.. Online Markov decision processes with Kullback–Leibler control cost. IEEE Trans. Automat. Control, 59(6):14231438, June 2014.CrossRefGoogle Scholar
Gupta, A., Jain, R., and Glynn, P. W.. An empirical algorithm for relative value iteration for average-cost MDPs. In Proc. of the Conf. on Dec. and Control, pages 50795084, 2015.CrossRefGoogle Scholar
Hajek, B.. Random Processes for Engineers. Cambridge University Press, Cambridge, UK, 2015.Google Scholar
Hartman, P.. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9(3):707713, 1959.CrossRefGoogle Scholar
Hastie, T., Tibshirani, R., and Friedman, J.. The Elements of Statistical Learning. Springer Series in Statistics. Springer-Verlag, New York, NY, 2nd ed., 2001. Corr. 3rd printing, 2003.Google Scholar
Henderson, S.. Variance Reduction via an Approximating Markov Process. PhD thesis, Stanford University, 1997.Google Scholar
Henderson, S. G. and Glynn, P. W.. Regenerative steady-state simulation of discrete event systems. ACM Trans. on Modeling and Computer Simulation, 11:313345, 2001.CrossRefGoogle Scholar
Henderson, S. G., Meyn, S. P., and Tadić, V. B.. Performance evaluation and policy selection in multiclass networks. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2):149189, 2003. Special issue on learning, optimization and decision making (invited).Google Scholar
Hernández-Hernández, D., Hernández-Lerma, O., and Taksar, M.. The linear programming approach to deterministic optimal control problems. Applicationes Mathematicae, 24(1):1733, 1996.CrossRefGoogle Scholar
Hernández-Lerma, O. and Lasserre, J. B.. The linear programming approach. In Handbook of Markov Decision Processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 377407. Kluwer Acad. Publ., Boston, MA, 2002.CrossRefGoogle Scholar
Hernández-Lerma, O. and Lasserre, J. B.. Discrete-Time Markov Control Processes: Basic Optimality Criteria, volume 30. Springer Science & Business Media, New York, NY, 2012.Google Scholar
Hu, B. and Lessard, L.. Dissipativity theory for Nesterov's accelerated method. In Proc. ICML, pages 15491557, 2017.Google Scholar
Hu, B., Wright, S., and Lessard, L.. Dissipativity theory for accelerating stochastic variance reduction: a unified analysis of SVRG and Katyusha using semidefinite programs. In Proc. ICML, pages 20382047, 2018.Google Scholar
Huang, D., Chen, W., Mehta, P., Meyn, S., and Surana, A.. Feature selection for neuro-dynamic programming. In Lewis, F., editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, Hoboken, NJ, 2011.Google Scholar
Huang, M., Caines, P. E., and Malhame, R. P.. Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ε-Nash equilibria. IEEE Trans. Automat. Control, 52(9):15601571, 2007.CrossRefGoogle Scholar
Huang, M., Malhame, R. P., and Caines, P. E.. Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Communications in Information and Systems, 6(3):221251, 2006.Google Scholar
Iserles, A.. A First Course in the Numerical Analysis of Differential Equations volume 44. Cambridge University Press, 2009.Google Scholar
Jaakola, T., Jordan, M., and Singh, S.. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:11851201, 1994.CrossRefGoogle Scholar
Jamieson, K. G., Nowak, R., and Recht, B.. Query complexity of derivative-free optimization. In Proc. Advances in Neural Information Processing Systems, pages 26722680, 2012.Google Scholar
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I.. Is Q-learning provably efficient? Proc. Advances in Neural Information Processing Systems, 31:48634873, 2018.Google Scholar
Kakade, S. and Langford, J.. Approximately optimal approximate reinforcement learning. In Proc. ICML, pages 267274, 2002.Google Scholar
Kakade, S. M.. A natural policy gradient. In Proc. Advances in Neural Information Processing Systems, pages 15311538, 2002.Google Scholar
Kalathil, D., Borkar, V. S., and Jain, R.. Empirical Q-value iteration. Stochastic Systems, 11(1):118, 2021.Google Scholar
Kalman, R. E.. Contribution to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5:102119, 1960.Google Scholar
Kalman, R. E.. When is a linear control system optimal? Journal of Basic Engineering, 86:51, 1964.Google Scholar
Kamoutsi, A., Sutter, T., Esfahani, P. Mohajerin, and Lygeros, J.. On infinite linear programming and the moment approach to deterministic infinite horizon discounted optimal control problems. IEEE Control Systems Letters, 1(1):134139, July 2017.Google Scholar
Kara, A. D. and Yuksel, S.. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability. arXiv preprint arXiv:2103.12158, 2021.Google Scholar
Karimi, H., Nutini, J., and Schmidt, M.. Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In European Conference on Machine Learning and Knowledge Discovery in Databases, volume 9851, pages 795811, Springer-Verlag, Berlin, Heidelberg, 2016.Google Scholar
Karmakar, P. and Bhatnagar, S.. Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res., 43(1):130151, 2018.CrossRefGoogle Scholar
Khalil, H. K.. Nonlinear Systems. Prentice Hall, Upper Saddle River, NJ, 3rd ed., 2002.Google Scholar
Kiefer, J. and Wolfowitz, J.. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462466, September 1952.CrossRefGoogle Scholar
Kim, Y. H. and Lewis, F. L.. High-Level Feedback Control with Neural Networks, volume 21. World Scientific, Hackensack, NJ, 1998.Google Scholar
Kiumarsi, B., Vamvoudakis, K. G., Modares, H., and Lewis, F. L.. Optimal and autonomous control using reinforcement learning: a survey. Transactions on Neural Networks and Learning Systems, 29(6):20422062, 2017.Google Scholar
Kohs, G.. AlphaGo, Ro*co Films, 2017.Google Scholar
Kokotović, P., Khalil, H. K., and O’Reilly, J.. Singular Perturbation Methods in Control: Analysis and Design. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999.Google Scholar
Koller, D. and Parr, R.. Policy iteration for factored MDPs. In Proc. of the 16th conference on Uncertainty in Artificial Intelligence, pages 326334, 2000.Google Scholar
Konda, V.. Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.Google Scholar
Konda, V. R.. Learning algorithms for Markov decision processes. Master's thesis, Indian Institute of Science, Dept. of Computer Science and Automation, 1997.Google Scholar
Konda, V. R. and Borkar, V. S.. Actor-critic–type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94123, 1999.Google Scholar
Konda, V. R. and Tsitsiklis, J. N.. Actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems, pages 10081014, 2000.Google Scholar
Konda, V. R. and Tsitsiklis, J. N.. On actor-critic algorithms. SIAM J. Control Optim., 42(4): 11431166 (electronic), 2003.Google Scholar
Konda, V. R. and Tsitsiklis, J. N.. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796819, 2004.Google Scholar
Kontoyiannis, I., Lastras-Montaño, L. A., and Meyn, S. P.. Relative entropy and exponential deviation bounds for general Markov chains. In Proc. of the IEEE Intl. Symposium on Information Theory, pages 15631567, September 2005.CrossRefGoogle Scholar
Kontoyiannis, I., Lastras-Montaño, L. A., and Meyn, S. P.. Exponential bounds and stopping rules for MCMC and general Markov chains. In Proc. of the 1st Intl. Conference on Performance Evaluation Methodolgies and Tools, Valuetools ’06, pages 15631567, Association for Computing Machinery, New York, NY, 2006.Google Scholar
Kontoyiannis, I. and Meyn, S. P.. Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Probab., 13:304362, 2003.Google Scholar
Kontoyiannis, I. and Meyn, S. P.. Large deviations asymptotics and the spectral theory of multiplicatively regular Markov processes. Electron. J. Probab., 10(3):61123 (electronic), 2005.Google Scholar
Kovachki, N. B. and Stuart, A. M.. Continuous time analysis of momentum methods. J. of Machine Learning Research, 22(17):140, 2021.Google Scholar
Krener, A.. Feedback linearization. In Baillieul, J. and Willems, J. C., editors, Mathematical Control Theory, pages 6698. Springer, 1999.Google Scholar
Krichene, W. and Bartlett, P. L.. Acceleration and averaging in stochastic descent dynamics. Proc. Advances in Neural Information Processing Systems, 30:67966806, 2017.Google Scholar
Krishnamurthy, V.. Structural results for partially observed Markov decision processes. ArXiv e-prints, page arXiv:1512.03873, 2015.Google Scholar
Krstic, M., Kokotovic, P. V., and Kanellakopoulos, I.. Nonlinear and Adaptive Control Design. John Wiley & Sons, Inc., New York, NY, 1995.Google Scholar
Kumar, P. R. and Seidman, T. I.. Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems. IEEE Trans. Automat. Control, AC-35(3):289298, March 1990.Google Scholar
Kushner, H. J. and Yin, G. G.. Stochastic Approximation Algorithms and Applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.Google Scholar
Kwakernaak, H. and Sivan, R.. Linear Optimal Control Systems. Wiley-Interscience, New York, NY, 1972.Google Scholar
Lagoudakis, M. G. and Parr, R.. Model-free least-squares policy iteration. In Proc. Advances in Neural Information Processing Systems, pages 15471554, 2002.Google Scholar
Lai, T. L.. Information bounds, certainty equivalence and learning in asymptotically efficient adaptive control of time-invariant stochastic systems. In Gerencséer, L. and Caines, P. E., editors, Topics in Stochastic Systems: Modelling, Estimation and Adaptive Control, pages 335368. Springer Verlag, Heidelberg, Germany, 1991.Google Scholar
Lai, T. L. and Robbins, H.. Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):422, 1985.Google Scholar
Lakshminarayanan, C. and Bhatnagar, S.. A stability criterion for two timescale stochastic approximation schemes. Automatica, 79:108114, 2017.Google Scholar
Lakshminarayanan, C. and Szepesvari, C.. Linear stochastic approximation: how far does constant step-size and iterate averaging go? In Intl. Conference on Artificial Intelligence and Statistics, pages 13471355, 2018.Google Scholar
Lange, S., Gabel, T., and Riedmiller, M.. Batch reinforcement learning. In Reinforcement learning, pages 4573. Springer, Freiberg, Germany, 2012.Google Scholar
Lapeybe, B., Pages, G., and Sab, K.. Sequences with low discrepancy generalisation and application to Robbins–Monro algorithm. Statistics, 21(2):251272, 1990.Google Scholar
Laruelle, S. and Pagès, G.. Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications, 18(1):151, 2012.Google Scholar
Lasry, J. M. and Lions, P. L.. Mean field games. Japan. J. Math., 2:229260, 2007.Google Scholar
Lasserre, J.-B.. Moments, Positive Polynomials and Their Applications, volume 1. World Scientific, Hackensack, NJ, 2010.Google Scholar
Lattimore, T. and Szepesvari, C.. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.Google Scholar
Le Blanc, M.. Sur l’electrification des chemins de fer au moyen de courants alternatifs de frequence elevee [On the electrification of railways by means of alternating currents of high frequency]. Revue Generale de l’Electricite, 12(8):275277, 1922.Google Scholar
Lee, D. and He, N.. Stochastic primal-dual Q-learning algorithm for discounted MDPs. In Proc. of the American Control Conf., pages 48974902, July 2019.Google Scholar
Lee, D. and He, N.. A unified switching system perspective and ODE analysis of Q-learning algorithms. arXiv, page arXiv:1912.02270, 2019.Google Scholar
Lee, J. and Sutton, R. S.. Policy iterations for reinforcement learning problems in continuous time and space – fundamental theory and methods. Automatica, 126:109421, 2021.Google Scholar
Lewis, F. L. and Liu, D.. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, volume 17. Wiley-IEEE Press, Hoboken, NJ, 2013.Google Scholar
Lewis, F. L., Vrabie, D., and Vamvoudakis, K. G.. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. Control Systems Magazine, 32(6):76105, December 2012.Google Scholar
Lewis, M.. Flash Boys: A Wall Street Revolt. W. W. Norton & Company, New York, NY, 2014.Google Scholar
Li, L. and Fu, J.. Topological approximate dynamic programming under temporal logic constraints. In Proc. of the Conf. on Dec. and Control, pages 53305337, 2019.Google Scholar
Liggett, T. M.. Stochastic Interacting Systems: Contact, Voter and Exclusion Processes, volume 324. Springer Science & Business Media, New York, NY, 2013.Google Scholar
Lipp, T. and Boyd, S.. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263287, 2016.CrossRefGoogle Scholar
Littman, M. L. and Szepesvári, C.. A generalized reinforcement-learning model: convergence and applications. In Proc. ICML, volume 96, pages 310318, 1996.Google Scholar
Liu, S. and Krstic, M.. Introduction to extremum seeking. In Stochastic Averaging and Stochastic Extremum Seeking, Communications and Control Engineering. Springer, London, UK, 2012.CrossRefGoogle Scholar
Ljung, L.. Analysis of recursive stochastic algorithms. Trans. on Automatic Control, 22(4):551575, 1977.Google Scholar
Luenberger, D.. Linear and Nonlinear Programming. Kluwer Academic Publishers, Norwell, MA, 2nd ed., 2003.Google Scholar
Luenberger, D. G.. Optimization by Vector Space Methods. John Wiley & Sons Inc., New York, NY, 1969. Reprinted 1997.Google Scholar
Lund, R. B., Meyn, S. P., and Tweedie, R. L.. Computable exponential convergence rates for stochastically ordered Markov processes. Ann. Appl. Probab., 6(1):218237, 1996.Google Scholar
MacKay, D. J. C.. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2003. Available from www.inference.phy.cam.ac.uk/mackay/itila/.Google Scholar
Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S.. Toward off-policy learning control with function approximation. In Proc. ICML, pages 719726, Omnipress, Madison, WI, 2010.Google Scholar
Mandl, P.. Estimation and control in Markov chains. Advances in Applied Probability, 6(1):4060, 1974.CrossRefGoogle Scholar
Mania, H., Guy, A., and Recht, B.. Simple random search provides a competitive approach to reinforcement learning. In Proc. Advances in Neural Information Processing Systems, pages 18001809, 2018.Google Scholar
Manne, A. S.. Linear programming and sequential decisions. Management Sci., 6(3):259267, 1960.Google Scholar
Marbach, P. and Tsitsiklis, J. N.. Simulation-based optimization of Markov reward processes: implementation issues. In Proc. of the Conf. on Dec. and Control, volume 2, pages 17691774. IEEE, 1999.Google Scholar
Marbach, P. and Tsitsiklis, J. N.. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191209, 2001.CrossRefGoogle Scholar
Mareels, I. M., Anderson, B. D., Bitmead, R. R., Bodson, M., and Sastry, S. S.. Revisiting the MIT rule for adaptive control. In Aström, K.J. and Wittenmark, B., editors Adaptive Systems in Control and Signal Processing 1986, pages 161166. Elsevier, Netherlands, 1987.Google Scholar
Matni, N., Proutiere, A., Rantzer, A., and Tu, S.. From self-tuning regulators to reinforcement learning and back again. In Proc. of the Conf. on Dec. and Control, pages 37243740, 2019.Google Scholar
Mayne, D., Rawlings, J., Rao, C., and Scokaert, P.. Constrained model predictive control: stability and optimality. Automatica, 36(6):789814, 2000.Google Scholar
Mayne, D. Q.. Model predictive control: recent developments and future promise. Automatica, 50(12):29672986, 2014.Google Scholar
Mazumdar, E., Pacchiano, A., Ma, Y.-a., Bartlett, P. L., and Jordan, M. I.. On Thompson sampling with Langevin algorithms. arXiv e-prints, pages arXiv–2002, 2020.Google Scholar
Mehta, P. G. and Meyn, S. P.. Q-learning and Pontryagin's minimum principle. In Proc. of the Conf. on Dec. and Control, pages 35983605, December 2009.Google Scholar
Mehta, P. G. and Meyn, S. P.. Convex Q-learning, part 1: deterministic optimal control. ArXiv e-prints:2008.03559, 2020.Google Scholar
Mehta, P. G., Meyn, S. P., Neu, G., and Lu, F.. Convex Q-learning. In Proc. of the American Control Conf., pages 47494756, 2021.Google Scholar
Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D.. On the global convergence rates of softmax policy gradient methods. arXiv eprint 2005.06392, 2020.Google Scholar
Melo, F. S., Meyn, S. P., and Ribeiro, M. I.. An analysis of reinforcement learning with function approximation. In Proc. ICML, pages 664671, ACM, New York, NY, 2008.Google Scholar
Metivier, M. and Priouret, P.. Theoremes de convergence presque sure pour une classe d’algorithmes stochastiques a pas decroissants. Prob. Theory Related Fields, 74:403428, 1987.Google Scholar
Meyer, C. D. Jr. The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review, 17(3):443464, 1975.Google Scholar
Meyn, S. P.. Workload models for stochastic networks: value functions and performance evaluation. IEEE Trans. Automat. Control, 50(8):11061122, August 2005.Google Scholar
Meyn, S. P.. Large deviation asymptotics and control variates for simulating large functions. Ann. Appl. Probab., 16(1):310339, 2006.Google Scholar
Meyn, S. P.. Control Techniques for Complex Networks. Cambridge University Press, 2007. Pre-publication ed. available online.Google Scholar
Meyn, S. P. and Mathew, G.. Shannon meets Bellman: feature based Markovian models for detection and optimization. In Proc. of the Conf. on Dec. and Control, pages 55585564, 2008.Google Scholar
Meyn, S. P. and Tweedie, R. L.. Computable bounds for convergence rates of Markov chains. Ann. Appl. Probab., 4:9811011, 1994.Google Scholar
Meyn, S. P. and Tweedie, R. L.. Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge, UK, 2nd ed., 2009. Published in the Cambridge Mathematical Library. 1993 ed. online.CrossRefGoogle Scholar
Michie, D. and Chambers, R. A.. Boxes: an experiment in adaptive control. Machine Intelligence, 2(2):137152, 1968.Google Scholar
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K.. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A.. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., etc. Human-level control through deep reinforcement learning. Nature, 518:529533, 2015.Google Scholar
Mohri, M., Rostamizadeh, A., and Talwalkar, A.. Foundations of Machine Learning. MIT Press, Cambridge, MA, 2018.Google Scholar
Molzahn, D. K., Dörfler, F., Sandberg, H., Low, S. H., Chakrabarti, S., Baldick, R., and Lavaei, J.. A survey of distributed optimization and control algorithms for electric power systems. Trans. on Smart Grid, 8(6):29412962, November 2017.Google Scholar
Moore, A. W.. Efficient Memory-Based Learning for Robot Control. PhD thesis, University of Cambridge, Computer Laboratory, 1990.Google Scholar
Mou, W., Junchi Li, C., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I.. On linear stochastic approximation: fine-grained Polyak–Ruppert and non-asymptotic concentration. arXiv e-prints, page arXiv:2004.04719, April 2020.Google Scholar
Moulines, E. and Bach, F. R.. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451459, 2011.Google Scholar
Murphy, K. P.. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA, 2012.Google Scholar
Murray, R.. Feedback control theory: architectures and tools for real-time decision making. Tutorial series at the Simons Institute Program on Real-Time Decision Making. https://simons.berkeley.edu/talks/murray-control-1, January 2018.Google Scholar
Nachum, O. and Dai, B.. Reinforcement learning via Fenchel–Rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.Google Scholar
Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.. Bridging the gap between value and policy based reinforcement learning. In Proc. Advances Neural Information Processing Systems, volume 10, page 8, 2017.Google Scholar
Nedic, A. and Bertsekas, D.. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79110, 2003.Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):15741609, 2009.Google Scholar
Nesterov, Y.. Lectures on Convex Optimization. Springer Optimization and Its Applications 137. Springer Intl. Publishing, New York, NY, 2018.Google Scholar
Nesterov, Y. and Spokoiny, V.. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527566, 2017.Google Scholar
Norris, J.. Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 1997.Google Scholar
Nowak, M. A.. Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press, Cambridge, MA, 2006.Google Scholar
Nummelin, E.. General Irreducible Markov Chains and Nonnegative Operators. Cambridge University Press, Cambridge, UK, 1984.Google Scholar
Oja, E.. A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3):267273, 1982.Google Scholar
Ormoneit, D. and Glynn, P.. Kernel-based reinforcement learning in average-cost problems. Trans. on Automatic Control, 47(10):16241636, October 2002.Google Scholar
Orr, J. S. and Dennehy, C. J.. Analysis of the X-15 flight 3-65-97 divergent limit-cycle oscillation. Journal of Aircraft, 54(1):135148, 2017.Google Scholar
Osband, I., Van Roy, B., and Wen, Z.. Generalization and exploration via randomized value functions. In Proc. ICML, pages 23772386, 2016.Google Scholar
Parikh, N. and Boyd, S.. Proximal Algorithms. Foundations and Trends in Optimization. Now Publishers, Norwell, MA, 2013.Google Scholar
Park, J. B. and Lee, J. Y.. Nonlinear adaptive control based on Lyapunov analysis: overview and survey. Journal of Institute of Control, Robotics and Systems, 20(3):261269, 2014.Google Scholar
Perkins, T. J. and Barto, A. G.. Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res., 3:803832, 2003.Google Scholar
Peters, J., Vijayakumar, S., and Schaal, S.. Reinforcement learning for humanoid robotics. In Proc. of the IEEE-RAS International Conference on Humanoid Robots, pages 120, 2003.Google Scholar
Polyak, B. T.. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643653, 1963.Google Scholar
Polyak, B. T.. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98107, 1990.Google Scholar
Polyak, B. T. and Juditsky, A. B.. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838855, 1992.Google Scholar
Powell, W. B.. Reinforcement Learning and Stochastic Optimization. John Wiley & Sons, Hoboken, NJ, 2021.Google Scholar
Principe, J. C.. Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces, pages 145. Springer New York, New York, NY, 2010.Google Scholar
Puterman, M. L.. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, 2014.Google Scholar
Qu, G. and Li, N.. On the exponential stability of primal-dual gradient dynamics. Control Systems Letters, 3(1):4348, 2018.Google Scholar
Raginsky, M.. Divergence-based characterization of fundamental limitations of adaptive dynamical systems. In Conference on Communication, Control, and Computing, pages 107114, 2010.Google Scholar
Raginsky, M. and Bouvrie, J.. Continuous-time stochastic mirror descent on a network: variance reduction, consensus, convergence. In Proc. of the Conf. on Dec. and Control, pages 67936800, 2012.Google Scholar
Raginsky, M. and Rakhlin, A.. Information-based complexity, feedback and dynamics in convex programming. Transactions on Information Theory, 57(10):70367056, 2011.Google Scholar
Ramaswamy, A. and Bhatnagar, S.. A generalization of the Borkar–Meyn theorem for stochastic recursive inclusions. Math. Oper. Res., 42(3):648661, 2017.Google Scholar
Ramaswamy, A. and Bhatnagar, S.. Stability of stochastic approximations with “controlled Markov” noise and temporal difference learning. Trans. on Automatic Control, 64:26142620, 2019.Google Scholar
Rastrigin, L.. Extremum control by means of random scan. Avtomat. i Telemekh, 21(9):12641271, 1960.Google Scholar
Rastrigin, L. A.. Random search in problems of optimization, identification and training of control systems. Journal of Cybernetics, 3(3):93103, 1973.Google Scholar
Research Staff. Experience with the X-15 adaptive flight control system. TN D-6208, NASA Flight Research Center, Edwards, CA, 1971.Google Scholar
Robbins, H. and Monro, S.. A stochastic approximation method. Annals of Mathematical Statistics, 22:400407, 1951.Google Scholar
Rosenthal, J. S.. Correction: “Minorization conditions and convergence rates for Markov chain Monte Carlo.” J. Amer. Statist. Assoc., 90(431):1136, 1995.CrossRefGoogle Scholar
Rosenthal, J. S.. Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc., 90(430):558566, 1995.Google Scholar
Rudin, W.. Real and Complex Analysis. McGraw-Hill, New York, NY, 2nd ed., 1974.Google Scholar
Ruppert, D.. A Newton–Raphson version of the multivariate Robbins–Monro procedure. The Annals of Statistics, 13(1):236245, 1985.Google Scholar
Ruppert, D.. Efficient estimators from a slowly convergent Robbins–Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.Google Scholar
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z.. A Tutorial on Thompson Sampling. Now Publishers Inc., Norwell, MA, 2018.Google Scholar
Rybko, A. N. and Stolyar, A. L.. On the ergodicity of random processes that describe the functioning of open queueing networks. Problemy Peredachi Informatsii, 28(3):326, 1992.Google Scholar
Sacks, J.. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):373405, 1958.CrossRefGoogle Scholar
Schrittwieser, J., Antonoglou, I., Hubert, T., et al. Mastering Atari, Go, chess and Shogi by planning with a learned model. ArXiv, abs/1911.08265, 2019.Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P.. Trust region policy optimization. In Intl. Conference on Machine Learning, pages 18891897, 2015.Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.Google Scholar
Schweitzer, P. J.. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401403, 1968.Google Scholar
Schweitzer, P. J. and Seidmann, A.. Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110(2):568582, 1985.Google Scholar
Seneta, E.. Non-Negative Matrices and Markov Chains. Springer, New York, NY, 2nd ed., 1981.Google Scholar
Shannon, C.. A mathematical theory of communication. Bell System Tech. J., 27:379423, 623656, 1948.Google Scholar
Sharma, H., Jain, R., and Gupta, A.. An empirical relative value learning algorithm for non-parametric MDPs with continuous state space. In European Control Conference, pages 13681373. IEEE, 2019.Google Scholar
Shi, B., Du, S. S., Su, W., and Jordan, M. I.. Acceleration via symplectic discretization of high-resolution differential equations. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R., editors, Proc. Advances in Neural Information Processing Systems, pages 57445752, 2019.Google Scholar
Shirodkar, S. and Meyn, S.. Quasi stochastic approximation. In Proc. of the American Control Conf., pages 24292435, July 2011.Google Scholar
Shivam, S., Buckley, I., Wardi, Y., Seatzu, C., and Egerstedt, M.. Tracking control by the Newton– Raphson flow: applications to autonomous vehicles. CoRR, abs/1811.08033, 2018.Google Scholar
Sikora, R. and Skarbek, W.. On stability of Oja algorithm. In Polkowski, L. and Skowron, A., editors, Rough Sets and Current Trends in Computing, volume 1424 of Lecture Notes in Computer Science, pages 354360. Springer Verlag, Berlin 2009.Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., et al. A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science, 362(6419):11401144, 2018.Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M.. Deterministic policy gradient algorithms. In Proc. ICML, pages 387395, 2014.Google Scholar
Singh, S. P., Jaakkola, T., and Jordan, M.. Reinforcement learning with soft state aggregation. Proc. Advances in Neural Information Processing Systems, 7:361, 1995.Google Scholar
Smale, S.. A convergent process of price adjustment and global Newton methods. Journal of Mathematical Economics, 3(2):107120, July 1976.Google Scholar
Smallwood, R. D. and Sondik, E. J.. The optimal control of partially observable Markov processes over a finite horizon. Oper. Res., 21(5):10711088, October 1973.Google Scholar
Spall, J. C.. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332341, 1992.Google Scholar
Spall, J. C.. A stochastic approximation technique for generating maximum likelihood parameter estimates. In Proc. of the American Control Conf., pages 11611167. IEEE, 1987.Google Scholar
Spall, J. C.. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109112, 1997.CrossRefGoogle Scholar
Spong, M. W. and Block, D. J.. The pendubot: a mechatronic system for control research and education. In Proc. of the Conf. on Dec. and Control, pages 555556. IEEE, 1995.Google Scholar
Spong, M. W. and Praly, L.. Control of underactuated mechanical systems using switching and saturation. In Morse, A. S., editor, Control Using Logic-Based Switching, pages 162172. Springer, Berlin, Heidelberg 1997.CrossRefGoogle Scholar
Spong, M. W. and Vidyasagar, M.. Robot Dynamics and Control. John Wiley & Sons, Chichester, UK, 2008.Google Scholar
Srikant, R. and Ying, L.. Finite-time error bounds for linear stochastic approximation and TD learning. In Proc. COLT, pages 28032830, 2019.Google Scholar
Stratonovich, R. L.. Conditional Markov processes. SIAM J. Theory Probab. and Appl., 5:156178, 1960.Google Scholar
Su, W., Boyd, S., and Candes, E.. A differential equation for modeling nesterov's accelerated gradient method: theory and insights. In Proc. Advances in Neural Information Processing Systems, pages 25102518, 2014.Google Scholar
Subramanian, J. and Mahajan, A.. Approximate information state for partially observed systems. In Proc. of the Conf. on Dec. and Control, pages 16291636. IEEE, 2019.Google Scholar
Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A.. Approximate information state for approximate planning and reinforcement learning in partially observed systems. arXiv:2010.08843, 2020.Google Scholar
Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Online ed. at www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge, MA, 2nd ed., 2018.Google Scholar
Sutton, R. S.. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 1984.Google Scholar
Sutton, R. S.. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):944, 1988.CrossRefGoogle Scholar
Sutton, R. S.. Generalization in reinforcement learning: successful examples using sparse coarse coding. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 10381044, 1995.Google Scholar
Sutton, R. S. and Barto, A. G.. Toward a modern theory of adaptive networks: expectation and prediction. Psychological Review, 88(2):135, 1981.Google Scholar
Sutton, R. S., Barto, A. G., and Williams, R. J.. Reinforcement learning is direct adaptive optimal control. Control Systems Magazine, 12(2):1922, 1992.Google Scholar
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour., Y. Policy gradient methods for reinforcement learning with function approximation. In Proc. Advances in Neural Information Processing Systems, pages 10571063, 2000.Google Scholar
Sutton, R. S., Szepesvári, C., and Maei, H. R.. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 16091616, 2008.Google Scholar
Szepesvári, C.. The asymptotic convergence-rate of Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 10641070, 1997.Google Scholar
Szepesvári, C.. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, San Raphael, CA, 2010.Google Scholar
Tan, Y., Moase, W. H., Manzie, C., Nešić, D., and Mareels, I.. Extremum seeking from 1922 to 2010. In Proc. of the 29th Chinese Control Conference, pages 1426. IEEE, 2010.Google Scholar
Tanzanakis, A. and Lygeros, J.. Data-driven control of unknown systems: a linear programming approach. ArXiv, abs/2003.00779, 2020.Google Scholar
Tesauro, G.. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215219, 1994.Google Scholar
Thoppe, G. and Borkar, V.. A concentration bound for stochastic approximation via Alekseev's formula. Stochastic Systems, 9(1):126, 2019.Google Scholar
Tsitsiklis, J.. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185202, 1994.Google Scholar
Tsitsiklis, J. and van Roy, B.. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):18401851, 1999.Google Scholar
Tsitsiklis, J. N. and Roy, B. V.. Average cost temporal-difference learning. Automatica, 35(11): 17991808, 1999.CrossRefGoogle Scholar
Tsitsiklis, J. N. and Van Roy, B.. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):5994, 1996.Google Scholar
Tsitsiklis, J. N. and Van Roy, B.. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674690, 1997.Google Scholar
Tsypkin, Y. Z. and Nikolic, Z. J.. Adaptation and Learning in Automatic Systems. Academic Press, New York, NY, 1971.Google Scholar
Tzen, B. and Raginsky, M.. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Beygelzimer, A. and Hsu, D., editors, Proc. COLT, volume 99, pages 30843114, 2019.Google Scholar
Vamvoudakis, K. G., Lewis, F. L., and Vrabie, D.. Reinforcement learning with applications in autonomous control and game theory. In Angelov, P., editor, Handbook on Computer Learning and Intelligence. World Scientific, Hackensack, NJ, 2nd ed., 2021.Google Scholar
Vamvoudakis, K. G., Wan, Y., Lewis, F. L., and Cansever, D., editors. Handbook on Reinforcement Learning and Control. Studies in Systems, Decision and Control (SSDC), volume 325. Springer, Princeton, NJ, 2021.Google Scholar
van der Vaart, A. W.. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 1998.Google Scholar
van Handel, R.. Lecture notes on hidden Markov models. https://web.math.princeton.edu/~rvan/, 2008.Google Scholar
Van Roy, B.. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998. AAI0599623.Google Scholar
Vandenberghe, L. and Boyd, S.. Applications of semidefinite programming. Applied Numerical Mathematics, 29(3):283299, 1999.CrossRefGoogle Scholar
Vapnik, V.. Estimation of Dependences Based on Empirical Data. Springer Science & Business Media, New York, NY, 2006.Google Scholar
Venter, J. et al. An extension of the Robbins–Monro procedure. The Annals of Mathematical Statistics, 38(1):181190, 1967.Google Scholar
Vinter, R.. Convex duality and nonlinear optimal control. SIAM Journal on Control and Optimization, 31(2):51821, 03 1993.Google Scholar
Walton, N.. A short note on soft-max and policy gradients in bandits problems. arXiv preprint arXiv:2007.10297, 2020.Google Scholar
Wang, Y. and Boyd, S.. Performance bounds for linear stochastic control. Systems Control Lett., 58(3):178182, 2009.Google Scholar
Wardi, Y., Seatzu, C., Egerstedt, M., and Buckley, I.. Performance regulation and tracking via lookahead simulation: preliminary results and validation. In Proc. of the Conf. on Dec. and Control, pages 64626468, 2017.Google Scholar
Watkins, C. J. C. H.. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, 1989.Google Scholar
Watkins, C. J. C. H. and Dayan, P.. Q-learning. Machine Learning, 8(3-4):279292, 1992.Google Scholar
Weber, B.. Swift and slashing, computer topples Kasparov. New York Times, 12:262, 1997.Google Scholar
Whittle, P.. Risk-Sensitive Optimal Control. John Wiley and Sons, Chichester, NY, 1990.Google Scholar
Wibisono, A., Wilson, A. C., and Jordan, M. I.. A variational perspective on accelerated methods in optimization. Proc. of the National Academy of Sciences, 113:E7351E7358, 2016.Google Scholar
Williams, R. J.. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229256, 1992.CrossRefGoogle Scholar
Witten, I. H.. An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34(4):286295, 1977.Google Scholar
Wu, L.. Essential spectral radius for Markov semigroups. I. Discrete time case. Prob. Theory Related Fields, 128(2):255321, 2004.Google Scholar
Yaji, V. G. and Bhatnagar, S.. Stochastic recursive inclusions with non-additive iterate-dependent Markov noise. Stochastics, 90(3):330363, 2018.Google Scholar
Yin, H., Mehta, P., Meyn, S., and Shanbhag, U.. Synchronization of coupled oscillators is a game. IEEE Transactions on Automatic Control, 57(4):920935, 2012.Google Scholar
Yin, H., Mehta, P., Meyn, S., and Shanbhag, U.. Learning in mean-field games. IEEE Transactions on Automatic Control, 59(3):629644, March 2014.Google Scholar
Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V.. On the efficiency of equilibria in mean-field oscillator games. Dynamic Games and Applications, 4(2):177207, 2014.Google Scholar
Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M.. Variational policy gradient method for reinforcement learning with general utilities. Proc. Advances in Neural Information Processing Systems, 33:45724583, 2020.Google Scholar
Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A.. Direct Runge–Kutta discretization achieves acceleration. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 39043913, 2018.Google Scholar
Zhao, J. and Spong, M.. Hybrid control for global stabilization of the cart–pendulum system. Automatica, 37(12):19411951, 2001.Google Scholar
Zhou, K., Doyle, J. C., and Glover, K.. Robust and Optimal Control. Prentice Hall, Englewood Cliffs, NJ, 1996.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • References
  • Sean Meyn, University of Florida
  • Book: Control Systems and Reinforcement Learning
  • Online publication: 17 May 2022
  • Chapter DOI: https://doi.org/10.1017/9781009051873.018
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • References
  • Sean Meyn, University of Florida
  • Book: Control Systems and Reinforcement Learning
  • Online publication: 17 May 2022
  • Chapter DOI: https://doi.org/10.1017/9781009051873.018
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • References
  • Sean Meyn, University of Florida
  • Book: Control Systems and Reinforcement Learning
  • Online publication: 17 May 2022
  • Chapter DOI: https://doi.org/10.1017/9781009051873.018
Available formats
×