References

Sean Meyn

doi:10.1017/9781009051873.018

References

Published online by Cambridge University Press: 17 May 2022

Sean Meyn

Show author details

Sean Meyn: Affiliation:
University of Florida

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: Control Systems and Reinforcement Learning , pp. 415 - 430

DOI: https://doi.org/10.1017/9781009051873.018 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agarwal, A. and Dekel, O.. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proc. COLT, pages 28–40, 2010.Google Scholar

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G.. Optimality and approximation with policy gradient methods in Markov decision processes. In Proc. COLT, pages 64–66, 2020.Google Scholar

Agrawal, R.. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pages 1054–1078, 1995.Google Scholar

Alekseev, V. M.. An estimate for the perturbations of the solutions of ordinary differential equations (Russian). Westnik Moskov Unn. Ser, 1:28–36, 1961.Google Scholar

Alvarez, F., Attouch, H., Bolte, J., and Redont, P.. A second-order gradient-like dissipative dynamical system with Hessian-driven damping: application to optimization and mechanics. Journal de mathématiques pures et appliquées, 81(8):747–779, 2002.CrossRef Google Scholar

Amari, S.-I. and Douglas, S. C.. Why natural gradient? In ICASSP’98, volume 2, pages 1213–1216. IEEE, 1998.Google Scholar

Anderson, B. D. O. and Moore, J. B.. Optimal Control: Linear Quadratic Methods. Prentice Hall, Englewood Cliffs, NJ, 1990.Google Scholar

Andrew, L. L., Lin, M., and Wierman, A.. Optimality, fairness, and robustness in speed scaling designs. SIGMETRICS Perform. Eval. Rev., 38(1):37–48, June 2010.CrossRef Google Scholar

Anschel, O., Baram, N., and Shimkin, N.. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Proc. ICML, pages 176–185. JMLR.org, 2017.Google Scholar

Arapostathis, A., Borkar, V. S., Fernandez-Gaucherand, E., Ghosh, M. K., and Marcus, S. I.. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim., 31:282–344, 1993.Google Scholar

Ariyur, K. B. and Krstić, M.. Real Time Optimization by Extremum Seeking Control. John Wiley & Sons, Inc., New York, NY, 2003.CrossRef Google Scholar

Asmussen, S. and Glynn, P. W.. Stochastic Simulation: Algorithms and Analysis, volume 57 of Stochastic Modelling and Applied Probability. Springer-Verlag, New York, NY, 2007.CrossRef Google Scholar

Åström, K. J.. Optimal control of Markov processes with incomplete state information I. J. of Mathematical Analysis and Applications, 10:174–205, 1965.Google Scholar

Åström, K. J. and Furuta, K.. Swinging up a pendulum by energy control. Automatica, 36(2):287–295, 2000.CrossRef Google Scholar

Åström, K. J. and Murray, R. M.. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, Princeton, NJ, 2nd ed., 2020.Google Scholar

Attouch, H., Goudou, X., and Redont, P.. The heavy ball with friction method, I. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1–34, 2000.CrossRef Google Scholar

Auer, P., Cesa-Bianchi, N., and Fischer, P.. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.Google Scholar

Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H.. Speedy Q-learning. In Proc. Advances in Neural Information Processing Systems, pages 2411–2419, 2011.Google Scholar

Bach, F.. Learning Theory from First Principles. www.di.ens.fr/~fbach/ltfp book.pdf, 2021.Google Scholar

Bach, F. and Moulines, E.. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). In Proc. Advances in Neural Information Processing Systems, volume 26, pages 773–781, 2013.Google Scholar

Baird, L.. Residual algorithms: reinforcement learning with function approximation. In Prieditis, A. and Russell, S., editors, Proc. Machine Learning, pages 30–37. Morgan Kaufmann, San Francisco, CA, 1995.Google Scholar

Baird, L. C.. Reinforcement learning in continuous time: advantage updating. In Proc. of Intl. Conference on Neural Networks, volume 4, pages 2448–2453. IEEE, 1994.CrossRef Google Scholar

Baird III, L. C.. Reinforcement Learning through Gradient Descent. PhD thesis, US Air Force Academy, 1999.Google Scholar

Ball, F., Larédo, C., Sirl, D., and Tran, V. C.. Stochastic Epidemic Models with Inference, volume 2255. Springer Nature, Cham, 2019.Google Scholar

Bansal, N., Kimbrel, T., and Pruhs, K.. Speed scaling to manage energy and temperature. J. ACM, 54(1):1–39, March 2007.CrossRef Google Scholar

Barto, A., Sutton, R., and Anderson, C.. Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man and Cybernetics, 13(5):835–846, 1983.Google Scholar

Barto, A. G., Sutton, R. S., and Watkins, C. J. C. H.. Learning and sequential decision making. In Gabriel, M. and Moore, J. W., editors Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 539–602, MIT Press, Cambridge, MA, 1989.Google Scholar

Bas Serrano, J., Curi, S., Krause, A., and Neu, G.. Logistic Q-learning. In Banerjee, A. and Fukumizu, K., editors, Proc. of the Intl. Conference on Artificial Intelligence and Statistics, volume 130, pages 3610–3618, April 13–15 2021.Google Scholar

Basar, T., Meyn, S., and Perkins, W. R.. Lecture notes on control system theory and design. arXiv e-print 2007.01367, 2010.Google Scholar

Baumann, N.. Too fast to fail: is high-speed trading the next Wall Street disaster? Mother Jones, January/February 2013.Google Scholar

Baxter, J. and Bartlett, P. L.. Direct gradient-based reinforcement learning: I. gradient estimation algorithms. Technical report, Australian National University, 1999.Google Scholar

Baxter, J. and Bartlett, P. L.. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.CrossRef Google Scholar

Beck, J.. Strong Uniformity and Large Dynamical Systems. World Scientific, Hackensack, NJ, 2017.CrossRef Google Scholar

Bellman, R.. The stability of solutions of linear differential equations. Duke Math. J., 10(4):643–647, 1943.CrossRef Google Scholar

Bellman, R.. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.Google Scholar PubMed

Bellman, R., Bentsman, J., and Meerkov, S. M.. Stability of fast periodic systems. In Proc. of the American Control Conf., volume 3, pages 1319–1320. IEEE, 1984.Google Scholar

Benaïm, M.. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités, XXXIII, pages 1–68. Springer, Berlin, 1999.Google Scholar

Benveniste, A., Métivier, M., and Priouret, P.. Adaptive Algorithms and Stochastic Approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson.CrossRef Google Scholar

Benveniste, A., Métivier, M., and Priouret, P.. Adaptive Algorithms and Stochastic Approximations. Vol. 22. Springer Science & Business Media, Berlin, Heidelberg, 2012.Google Scholar

Bernstein, A., Chen, Y., Colombino, M., Dall’Anese, E., Mehta, P., and Meyn, S.. Optimal rate of convergence for quasi-stochastic approximation. arXiv:1903.07228, 2019.Google Scholar

Bernstein, A., Chen, Y., Colombino, M., Dall’Anese, E., Mehta, P., and Meyn, S.. Quasi-stochastic approximation and off-policy reinforcement learning. In Proc. of the Conf. on Dec. and Control, pages 5244–5251, March 2019.Google Scholar

Bertsekas, D.. Multiagent rollout algorithms and reinforcement learning. arXiv preprint arXiv:1910.00120, 2019.Google Scholar

Bertsekas, D. and Shreve, S.. Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, Belmont, MA 1996.Google Scholar

Bertsekas, D. and Tsitsiklis, J. N.. Neuro-Dynamic Programming. Athena Scientific, Cambridge, MA, 1996.Google Scholar

Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume II. Athena Scientific, Belmont, MA, 4th ed., 2012.Google Scholar

Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume 1. Athena Scientific, Belmont, MA, 4th ed., 2017.Google Scholar PubMed

Bertsekas, D. P.. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, MA, 2019.Google Scholar

Bhandari, J. and Russo, D.. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv: 1906.01786, 2019.Google Scholar

Bhandari, J., Russo, D., and Singal, R.. A finite time analysis of temporal difference learning with linear function approximation. In Proc. COLT, pages 1691–1692, 2018.Google Scholar

Bhatnagar, S.. Simultaneous perturbation and finite difference methods. Wiley Encyclopedia of Operations Research and Management Science, https://onlinelibrary.wiley.com/doi/10.1002/9780470400531.eorms0784, 2010.Google Scholar

Bhatnagar, S. and Borkar, V. S.. Multiscale chaotic SPSA and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003.Google Scholar

Bhatnagar, S., Fu, M. C., Marcus, S. I., and Wang, I.-J.. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation (TOMACS), 13(2):180–209, 2003.Google Scholar

Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S.. Incremental natural actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems, pages 105–112, 2008.Google Scholar

Bhatnagar, S., Prasad, H., and Prashanth, L.. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. Lecture Notes in Control and Information Sciences. Springer, London, 2013.Google Scholar

Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R., and Szepesvári, C.. Convergent temporal-difference learning with arbitrary smooth function approximation. In Proc. Advances in Neural Information Processing Systems, pages 1204–1212, 2009.Google Scholar

Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M.. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.Google Scholar

Bishop, C. M.. Pattern Recognition and Machine Learning. Springer, 2006.Google Scholar

Blum, J. R.. Multidimensional stochastic approximation methods. The Annals of Mathematical Statistics, 25(4): 737–744, 1954.CrossRef Google Scholar

Borkar, V. and Meyn, S. P.. Oja's algorithm for graph clustering, Markov spectral decomposition, and risk sensitive control. Automatica, 48(10):2512–2519, 2012.CrossRef Google Scholar

Borkar, V. and Varaiya, P.. Adaptive control of Markov chains, i: finite parameter set. IEEE Trans. Automat. Control, 24(6):953–957, 1979.CrossRef Google Scholar

Borkar, V. and Varaiya, P.. Identification and adaptive control of Markov chains. SIAM J. Control Optim., 20(4):470–489, 1982.Google Scholar

Borkar, V. S.. Identification and Adaptive Control of Markov Chains. PhD thesis, University of California, Berkeley, 1980.Google Scholar

Borkar, V. S.. Convex analytic methods in Markov decision processes. In Handbook of Markov Decision Processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 347–375. Kluwer Acad. Publ., Boston, MA, 2002.Google Scholar

Borkar, V. S.. Reinforcement learning – a bridge between numerical methods and Markov Chain Monte Carlo. In Sastry, N. S. N., Rajeev, B., Delampady, M., and Rao, T. S. S. R. K., editors, Perspectives in Mathematical Sciences, pages 71–91. World Scientific, Singapore, 2009.Google Scholar

Borkar, V. S.. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India, and Cambridge, UK, 2008.Google Scholar

Borkar, V. S.. Stochastic Approximation: A Dynamical Systems Viewpoint (2nd ed., to appear). Hindustan Book Agency, Delhi, India, and Cambridge, UK, 2020.Google Scholar

Borkar, V. S. and Gaitsgory, V.. Linear programming formulation of long-run average optimal control problem. Journal of Optimization Theory and Applications, 181(1):101–125, 2019.CrossRef Google Scholar

Borkar, V. S., Gaitsgory, V., and Shvartsman, I.. LP formulations of discrete time long-run average optimal control problems: the non ergodic case. SIAM Journal on Control and Optimization, 57(3):1783–1817, 2019.Google Scholar

Borkar, V. S. and Meyn, S. P.. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.CrossRef Google Scholar

Boyan, J. A.. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2–3): 233–246, 2002.CrossRef Google Scholar

Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan, V.. Linear Matrix Inequalities in System and Control Theory, volume 15. SIAM, 1994.CrossRef Google Scholar

Boyd, S., Parikh, N., and Chu, E.. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Publishers Inc, Norwell, MA, 2011.Google Scholar

Boyd, S. and Vandenberghe, L.. Convex Optimization, 1st edition. Cambridge University Press, New York, 1st ed., 2004.CrossRef Google Scholar

Bradtke, S., Ydstie, B., and Barto, A.. Adaptive linear quadratic control using policy iteration. In Proc. of the American Control Conf., volume 3, pages 3475–3479, 1994.Google Scholar

Bradtke, S. J. and Barto, A. G.. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.Google Scholar

Brogan, W. L.. Modern Control Theory. Pearson, 3rd ed., 1990.Google Scholar

Bu, J., Mesbahi, A., Fazel, M., and Mesbahi, M.. LQR through the lens of first order methods: discrete-time case. arXiv e-prints, page arXiv:1907.08921, 2019.Google Scholar

Bubeck, S. and Cesa-Bianchi, N.. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1–122, 2012.Google Scholar

Butcher, J. C.. Numerical Methods for Ordinary Differential Equations. John Wiley & Sons, New York, NY 2016.Google Scholar

Caines, P. E.. Linear Stochastic Systems. John Wiley & Sons, New York, NY, 1988.Google Scholar

Caines, P. E.. Mean field games. In Baillieul, J. and Samad, T., editors, Encyclopedia of Systems and Control, pages 706–712. Springer London, London, UK, 2015.Google Scholar

Chatterjee, D., Patra, A., and Joglekar, H. K.. Swing-up and stabilization of a cart–pendulum system under restricted cart track length. Systems & Control Letters, 47(4):355–364, 2002.CrossRef Google Scholar

Chen, H. and Guo, L.. Identification and Stochastic Adaptive Control. Birkhauser, Boston, MA, 1991.CrossRef Google Scholar

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D.. Neural ordinary differential equations. In Proc. Advances Neural Information Processing Systems, volume 32, pages 6572–6583, 2018.Google Scholar

Chen, S., Bernstein, A., Devraj, A., and Meyn, S.. Stability and acceleration for quasi stochastic approximation. arXiv:2009.14431, 2020.Google Scholar

Chen, S., Devraj, A., Bernstein, A., and Meyn, S.. Accelerating optimization and reinforcement learning with quasi stochastic approximation. In Proc. of the American Control Conf., pages 1965–1972, May 2021.Google Scholar

Chen, S., Devraj, A., Bernstein, A., and Meyn, S.. Revisiting the ODE method for recursive algorithms: fast convergence using quasi stochastic approximation. Journal of Systems Science and Complexity. Special Issue on Advances on Fundamental Problems in Control Systems, in Honor of Prof. Lei Guo's 60th birthday, 34(5):1681–1702, 2021.Google Scholar

Chen, S., Devraj, A., Borkar, V., Kontoyiannis, I., and Meyn, S.. The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning. Submitted for publication, 2021.Google Scholar

Chen, S., Devraj, A. M., Bušić, A., and Meyn, S.. Explicit mean-square error bounds for Monte-Carlo and linear stochastic approximation. In Chiappa, S. and Calandra, R., editors, Proc. of AISTATS, volume 108, pages 4173–4183, 2020.Google Scholar

Chen, S., Devraj, A. M., Lu, F., Busic, A., and Meyn, S.. Zap Q-Learning with nonlinear function approximation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, and arXiv e-prints 1910.05405, volume 33, pages 16879–16890, 2020.Google Scholar

Chen, T., Hua, Y., and Yan, W.-Y.. Global convergence of Oja's subspace algorithm for principal component extraction. IEEE Trans. Neural Networks, 9(1):58–67, January 1998.CrossRef Google Scholar PubMed

Chen, W., Huang, D., Kulkarni, A. A., et al. Approximate dynamic programming using fluid and diffusion approximations with applications to power management. In Proc. of the 48th IEEE Conf. on Dec. and Control; Held Jointly with the 2009 28th Chinese Control Conference, pages 3575–3580, 2009.Google Scholar

Chen, Y., Bernstein, A., Devraj, A., and Meyn, S.. Model-free primal-dual methods for network optimization with application to real-time optimal power flow. In Proc. of the American Control Conf., pages 3140–3147, September 2019.CrossRef Google Scholar

Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M.. A Lyapunov-based approach to safe reinforcement learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Proc. Advances in Neural Information Processing Systems, pages 8092–8101, 2018.Google Scholar

Chung, K. L. et al. On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3):463–483, 1954.Google Scholar

Colombino, M., Dall’Anese, E., and Bernstein, A.. Online optimization as a feedback controller: stability and tracking. Trans. on Control of Network Systems, 7(1):422–432, 2020.CrossRef Google Scholar

Cover, T. M. and Thomas, J. A.. Elements of Information Theory. John Wiley & Sons Inc., New York, NY, 1991.Google Scholar

Dai, J. G.. On positive Harris recurrence of multiclass queueing networks: a unified approach via fluid limit models. Ann. Appl. Probab., 5(1):49–77, 1995.CrossRef Google Scholar

Dai, J. G. and Meyn, S. P.. Stability and convergence of moments for multiclass queueing networks via fluid limit models. IEEE Trans. Automat. Control, 40:1889–1904, 1995.Google Scholar

Dai, J. G. and Vande Vate, J. H.. The stability of two-station multi-type fluid networks. Operations Res., 48:721–744, 2000.CrossRef Google Scholar

Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S.. Concentration bounds for two timescale stochastic approximation with applications to reinforcement learning. Proc. of the Conference on Computational Learning Theory, pages 1–35, 2017.Google Scholar

de Farias, D. P. and Van Roy, B.. The linear programming approach to approximate dynamic programming. Operations Res., 51(6):850–865, 2003.Google Scholar

de Farias, D. P. and Van Roy, B.. On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res., 29(3):462–478, 2004.CrossRef Google Scholar

de Farias, D. P. and Van Roy, B.. A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res., 31(3):597–620, 2006.CrossRef Google Scholar

Dembo, A. and Zeitouni, O.. Large Deviations Techniques and Applications. Springer-Verlag, New York, NY, 2nd ed., 1998.Google Scholar

Derman, C.. Finite State Markovian Decision Processes, volume 67 of Mathematics in Science and Engineering. Academic Press, Inc., Orlando, FL, 1970.Google Scholar

Devraj, A. M.. Reinforcement Learning Design with Optimal Learning Rate. PhD thesis, University of Florida, 2019.Google Scholar

Devraj, A. M., Bušić, A., and Meyn, S.. On matrix momentum stochastic approximation and applications to Q-learning. In Allerton Conference on Communication, Control, and Computing, pages 749–756, September 2019.CrossRef Google Scholar

Devraj, A. M., Bušić, A., and Meyn, S.. Zap Q-Learning – a user's guide. In Proc. of the Fifth Indian Control Conference, https://par.nsf.gov/servlets/purl/10211835, January 9–11 2019.Google Scholar

Devraj, A. M., Bušić, A., and Meyn, S.. Fundamental design principles for reinforcement learning algorithms. In Vamvoudakis, K. G., Wan, Y., Lewis, F. L., and Cansever, D., editors, Handbook on Reinforcement Learning and Control, Studies in Systems, Decision and Control (SSDC) series (volume 325). Springer, 2021.Google Scholar

Devraj, A. M., Kontoyiannis, I., and Meyn, S. P.. Differential temporal difference learning. IEEE Trans. Automat. Control, 66(10): 4652–4667, doi: 10.1109/TAC.2020.3033417. October 2021.CrossRef Google Scholar

Devraj, A. M. and Meyn, S. P.. Fastest convergence for Q-learning. ArXiv e-prints, July 2017.Google Scholar

Devraj, A. M. and Meyn, S. P.. Zap Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages, 2232–2241, 2017.Google Scholar

Devraj, A. M. and Meyn, S. P.. Q-learning with uniformly bounded variance: large discounting is not a barrier to fast learning. arXiv e-prints, pages arXiv:2002.10301 (and to appear, IEEE Trans Auto Control), February 2020.Google Scholar

Diaconis, P.. The Markov chain Monte Carlo revolution. Bull. Amer. Math. Soc. (N.S.), 46(2): 179–205, 2009.CrossRef Google Scholar

Ding, D. and Jovanović, M. R.. Global exponential stability of primal-dual gradient flow dynamics based on the proximal augmented Lagrangian. In Proc. of the American Control Conf., pages 3414–3419. IEEE, 2019.Google Scholar

Douc, R., Moulines, E., Priouret, P., and Soulier, P.. Markov Chains. Springer, Cham, 2018.Google Scholar

Douc, R., Moulines, É., and Stoffer, D.. Nonlinear Time Series : Theory, Methods and Applications with R Examples. Texts in Statistical Science. Chapman et Hall–CRC Press, 2014.CrossRef Google Scholar

Duffy, K. and Meyn, S.. Large deviation asymptotics for busy periods. Stochastic Systems, 4(1): 300–319, 2014.CrossRef Google Scholar

Duffy, K. R. and Meyn, S. P.. Most likely paths to error when estimating the mean of a reflected random walk. Performance Evaluation, 67(12):1290–1303, 2010.Google Scholar

Dupree, K., Patre, P. M., Johnson, M., and Dixon, W. E.. Inverse optimal adaptive control of a nonlinear Euler–Lagrange system, Part I: Full state feedback. In Proc. of the Conference on Decision and Control, Held Jointly with Chinese Control Conference, pages 321–326, 2009.CrossRef Google Scholar

Durrett, R.. Stochastic spatial models. SIAM Review, 41(4):677–718, 1999.CrossRef Google Scholar

Dynkin, E. B. and Yushkevich, A. A.. Controlled Markov Processes, volume 235 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1979. Translated from the Russian original by Danskin, J. M. and Holland, C..Google Scholar

Even-Dar, E. and Mansour, Y.. Learning rates for Q-learning. J. of Machine Learning Research, 5:1–25, 2003.Google Scholar

Fabian, V. et al. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1327–1332, 1968.Google Scholar

Farahmand, A.-M. and Ghavamzadeh, M.. PID accelerated value iteration algorithm. In Meila, M. and Zhang, T., editors, Proc. ICML, volume 139, pages 3143–3153, July 18–24 2021.Google Scholar

Fazlyab, M., Ribeiro, A., Morari, M., and Preciado, V. M.. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM Journal on Optimization, 28(3):2654–2689, 2018.CrossRef Google Scholar

Feinberg, E. and Shwartz, A., editors. Markov Decision Processes: Models, Methods, Directions, and Open Problems. Kluwer Acad. Publ., Holland, 2001.Google Scholar

Feinberg, E. A. and Shwartz, A., editors. Handbook of Markov Decision processes. Intl. Series in Operations Research & Management Science, 40. Kluwer Academic Publishers, Boston, MA, 2002. Methods and applications.Google Scholar

Feintuch, A. and Francis, B.. Infinite chains of kinematic points. Automatica, 48(5):901–908, 2012.CrossRef Google Scholar

Feng, Y., Li, L., and Liu, Q.. A kernel loss for solving the Bellman equation. In Proc. Advances in Neural Information Processing Systems, pages 15456–15467, 2019.Google Scholar

Finlay, L., Gaitsgory, V., and Lebedev, I.. Duality in linear programming problems related to deterministic long run average problems of optimal control. SIAM Journal on Control and Optimization, 47(4):1667–1700, 2008.Google Scholar

Flegal, J. M. and Jones, G. L.. Batch means and spectral variance estimators in Markov chain Monte Carlo. Annals of Statistics, 38(2):1034–1070, 04 2010.CrossRef Google Scholar

Fort, G., Moulines, E., Meyn, S. P., and Priouret, P.. ODE methods for Markov chain stability with applications to MCMC. In Valuetools ’06: Proceedings of the 1st International Conference on Performance Evaluation Methodolgies and Tools, page 42, ACM Press, New York, NY, 2006.Google Scholar

Foster, F. G.. On Markoff chains with an enumerable infinity of states. Proc. Cambridge Phil. Soc., 47:587–591, 1952.CrossRef Google Scholar

Fradkov, A. and Polyak, B. T.. Adaptive and robust control in the USSR. IFAC–PapersOnLine, 53(2):1373–1378, 2020. 21th IFAC World Congress.Google Scholar

Furuta, K., Yamakita, M., and Kobayashi, S.. Swing up control of inverted pendulum. In Proc. Intl. Conference on Industrial Electronics, Control and Instrumentation, pages 2193–2198. IEEE, 1991.Google Scholar

Gagniuc, P. A.. Markov Chains: From Theory to Implementation and Experimentation. John Wiley & Sons, New York, NY, 2017.Google Scholar

Gaitsgory, V., Parkinson, A., and Shvartsman, I.. Linear programming formulations of deterministic infinite horizon optimal control problems in discrete time. Discrete and Continuous Dynamical Systems – Series B, 22(10):3821–3838, 2017.Google Scholar

Gaitsgory, V. and Quincampoix, M.. On sets of occupational measures generated by a deterministic control system on an infinite time horizon. Nonlinear Analysis: Theory, Methods and Applications, 88:27–41, 2013.Google Scholar

George, J. M. and Harrison, J. M.. Dynamic control of a queue with adjustable service rate. Operations Res., 49(5):720–731, September 2001.CrossRef Google Scholar

Glynn, P. W.. Stochastic approximation for Monte Carlo optimization. In Proc. of the 18th Conference on Winter Simulation, pages 356–365, 1986.Google Scholar

Glynn, P. W.. Likelihood ratio gradient estimation: an overview. In Proc. of the Winter Simulation Conference, pages 366–375, 1987.Google Scholar

Glynn, P. W. and Meyn, S. P.. A Liapounov bound for solutions of the Poisson equation. Ann. Probab., 24(2):916–931, 1996.CrossRef Google Scholar

Goodwin, G. C. and Sin, K. S.. Adaptive Filtering Prediction and Control. Prentice Hall, Englewood Cliffs, NJ, 1984.Google Scholar

Gordon, G. J.. Stable function approximation in dynamic programming. In Proc. ICML (see also the full-length technical report, CMU-CS-95-103), pages 261–268. Elsevier, Netherlands, 1995.Google Scholar

Gordon, G. J.. Reinforcement learning with function approximation converges to a region. In Proc. of the 13th Intl. Conference on Neural Information Processing Systems, pages 996–1002, Cambridge, MA, 2000.Google Scholar

Gosavi, A.. Simulation-Based Optimization. Springer, Berlin, 2015.CrossRef Google Scholar

Graham, R. L., Knuth, D. E., and Patashnik, O.. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 2nd ed., 1994.Google Scholar

Greenemeier, L.. AI versus AI: self-taught AlphaGo Zero vanquishes its predecessor. Scientific American, 371(4), www.scientificamerican.com/article/ai-versus-ai-self-taught-alphago-zero-vanquishes-its-predecessor/, October 2017.Google Scholar

Greensmith, E., Bartlett, P. L., and Baxter, J.. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471–1530, 2004.Google Scholar

Guan, P., Raginsky, M., and Willett, R.. Online Markov decision processes with Kullback–Leibler control cost. IEEE Trans. Automat. Control, 59(6):1423–1438, June 2014.CrossRef Google Scholar

Gupta, A., Jain, R., and Glynn, P. W.. An empirical algorithm for relative value iteration for average-cost MDPs. In Proc. of the Conf. on Dec. and Control, pages 5079–5084, 2015.CrossRef Google Scholar

Hajek, B.. Random Processes for Engineers. Cambridge University Press, Cambridge, UK, 2015.Google Scholar

Hartman, P.. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9(3):707–713, 1959.CrossRef Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J.. The Elements of Statistical Learning. Springer Series in Statistics. Springer-Verlag, New York, NY, 2nd ed., 2001. Corr. 3rd printing, 2003.Google Scholar

Henderson, S.. Variance Reduction via an Approximating Markov Process. PhD thesis, Stanford University, 1997.Google Scholar

Henderson, S. G. and Glynn, P. W.. Regenerative steady-state simulation of discrete event systems. ACM Trans. on Modeling and Computer Simulation, 11:313–345, 2001.CrossRef Google Scholar

Henderson, S. G., Meyn, S. P., and Tadić, V. B.. Performance evaluation and policy selection in multiclass networks. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2):149–189, 2003. Special issue on learning, optimization and decision making (invited).Google Scholar

Hernández-Hernández, D., Hernández-Lerma, O., and Taksar, M.. The linear programming approach to deterministic optimal control problems. Applicationes Mathematicae, 24(1):17–33, 1996.CrossRef Google Scholar

Hernández-Lerma, O. and Lasserre, J. B.. The linear programming approach. In Handbook of Markov Decision Processes, volume 40 of Internat. Ser. Oper. Res. Management Sci., pages 377–407. Kluwer Acad. Publ., Boston, MA, 2002.CrossRef Google Scholar

Hernández-Lerma, O. and Lasserre, J. B.. Discrete-Time Markov Control Processes: Basic Optimality Criteria, volume 30. Springer Science & Business Media, New York, NY, 2012.Google Scholar

Hu, B. and Lessard, L.. Dissipativity theory for Nesterov's accelerated method. In Proc. ICML, pages 1549–1557, 2017.Google Scholar

Hu, B., Wright, S., and Lessard, L.. Dissipativity theory for accelerating stochastic variance reduction: a unified analysis of SVRG and Katyusha using semidefinite programs. In Proc. ICML, pages 2038–2047, 2018.Google Scholar

Huang, D., Chen, W., Mehta, P., Meyn, S., and Surana, A.. Feature selection for neuro-dynamic programming. In Lewis, F., editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, Hoboken, NJ, 2011.Google Scholar

Huang, M., Caines, P. E., and Malhame, R. P.. Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ε-Nash equilibria. IEEE Trans. Automat. Control, 52(9):1560–1571, 2007.CrossRef Google Scholar

Huang, M., Malhame, R. P., and Caines, P. E.. Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Communications in Information and Systems, 6(3):221–251, 2006.Google Scholar

Iserles, A.. A First Course in the Numerical Analysis of Differential Equations volume 44. Cambridge University Press, 2009.Google Scholar

Jaakola, T., Jordan, M., and Singh, S.. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201, 1994.CrossRef Google Scholar

Jamieson, K. G., Nowak, R., and Recht, B.. Query complexity of derivative-free optimization. In Proc. Advances in Neural Information Processing Systems, pages 2672–2680, 2012.Google Scholar

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I.. Is Q-learning provably efficient? Proc. Advances in Neural Information Processing Systems, 31:4863–4873, 2018.Google Scholar

Kakade, S. and Langford, J.. Approximately optimal approximate reinforcement learning. In Proc. ICML, pages 267–274, 2002.Google Scholar

Kakade, S. M.. A natural policy gradient. In Proc. Advances in Neural Information Processing Systems, pages 1531–1538, 2002.Google Scholar

Kalathil, D., Borkar, V. S., and Jain, R.. Empirical Q-value iteration. Stochastic Systems, 11(1):1–18, 2021.Google Scholar

Kalman, R. E.. Contribution to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5:102–119, 1960.Google Scholar

Kalman, R. E.. When is a linear control system optimal? Journal of Basic Engineering, 86:51, 1964.Google Scholar

Kamoutsi, A., Sutter, T., Esfahani, P. Mohajerin, and Lygeros, J.. On infinite linear programming and the moment approach to deterministic infinite horizon discounted optimal control problems. IEEE Control Systems Letters, 1(1):134–139, July 2017.Google Scholar

Kara, A. D. and Yuksel, S.. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability. arXiv preprint arXiv:2103.12158, 2021.Google Scholar

Karimi, H., Nutini, J., and Schmidt, M.. Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In European Conference on Machine Learning and Knowledge Discovery in Databases, volume 9851, pages 795–811, Springer-Verlag, Berlin, Heidelberg, 2016.Google Scholar

Karmakar, P. and Bhatnagar, S.. Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Math. Oper. Res., 43(1):130–151, 2018.CrossRef Google Scholar

Khalil, H. K.. Nonlinear Systems. Prentice Hall, Upper Saddle River, NJ, 3rd ed., 2002.Google Scholar

Kiefer, J. and Wolfowitz, J.. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466, September 1952.CrossRef Google Scholar

Kim, Y. H. and Lewis, F. L.. High-Level Feedback Control with Neural Networks, volume 21. World Scientific, Hackensack, NJ, 1998.Google Scholar

Kiumarsi, B., Vamvoudakis, K. G., Modares, H., and Lewis, F. L.. Optimal and autonomous control using reinforcement learning: a survey. Transactions on Neural Networks and Learning Systems, 29(6):2042–2062, 2017.Google Scholar

Kohs, G.. AlphaGo, Ro*co Films, 2017.Google Scholar

Kokotović, P., Khalil, H. K., and O’Reilly, J.. Singular Perturbation Methods in Control: Analysis and Design. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999.Google Scholar

Koller, D. and Parr, R.. Policy iteration for factored MDPs. In Proc. of the 16th conference on Uncertainty in Artificial Intelligence, pages 326–334, 2000.Google Scholar

Konda, V.. Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.Google Scholar

Konda, V. R.. Learning algorithms for Markov decision processes. Master's thesis, Indian Institute of Science, Dept. of Computer Science and Automation, 1997.Google Scholar

Konda, V. R. and Borkar, V. S.. Actor-critic–type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38(1):94–123, 1999.Google Scholar

Konda, V. R. and Tsitsiklis, J. N.. Actor-critic algorithms. In Proc. Advances in Neural Information Processing Systems, pages 1008–1014, 2000.Google Scholar

Konda, V. R. and Tsitsiklis, J. N.. On actor-critic algorithms. SIAM J. Control Optim., 42(4): 1143–1166 (electronic), 2003.Google Scholar

Konda, V. R. and Tsitsiklis, J. N.. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004.Google Scholar

Kontoyiannis, I., Lastras-Montaño, L. A., and Meyn, S. P.. Relative entropy and exponential deviation bounds for general Markov chains. In Proc. of the IEEE Intl. Symposium on Information Theory, pages 1563–1567, September 2005.CrossRef Google Scholar

Kontoyiannis, I., Lastras-Montaño, L. A., and Meyn, S. P.. Exponential bounds and stopping rules for MCMC and general Markov chains. In Proc. of the 1st Intl. Conference on Performance Evaluation Methodolgies and Tools, Valuetools ’06, pages 1563–1567, Association for Computing Machinery, New York, NY, 2006.Google Scholar

Kontoyiannis, I. and Meyn, S. P.. Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Probab., 13:304–362, 2003.Google Scholar

Kontoyiannis, I. and Meyn, S. P.. Large deviations asymptotics and the spectral theory of multiplicatively regular Markov processes. Electron. J. Probab., 10(3):61–123 (electronic), 2005.Google Scholar

Kovachki, N. B. and Stuart, A. M.. Continuous time analysis of momentum methods. J. of Machine Learning Research, 22(17):1–40, 2021.Google Scholar

Krener, A.. Feedback linearization. In Baillieul, J. and Willems, J. C., editors, Mathematical Control Theory, pages 66–98. Springer, 1999.Google Scholar

Krichene, W. and Bartlett, P. L.. Acceleration and averaging in stochastic descent dynamics. Proc. Advances in Neural Information Processing Systems, 30:6796–6806, 2017.Google Scholar

Krishnamurthy, V.. Structural results for partially observed Markov decision processes. ArXiv e-prints, page arXiv:1512.03873, 2015.Google Scholar

Krstic, M., Kokotovic, P. V., and Kanellakopoulos, I.. Nonlinear and Adaptive Control Design. John Wiley & Sons, Inc., New York, NY, 1995.Google Scholar

Kumar, P. R. and Seidman, T. I.. Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems. IEEE Trans. Automat. Control, AC-35(3):289–298, March 1990.Google Scholar

Kushner, H. J. and Yin, G. G.. Stochastic Approximation Algorithms and Applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997.Google Scholar

Kwakernaak, H. and Sivan, R.. Linear Optimal Control Systems. Wiley-Interscience, New York, NY, 1972.Google Scholar

Lagoudakis, M. G. and Parr, R.. Model-free least-squares policy iteration. In Proc. Advances in Neural Information Processing Systems, pages 1547–1554, 2002.Google Scholar

Lai, T. L.. Information bounds, certainty equivalence and learning in asymptotically efficient adaptive control of time-invariant stochastic systems. In Gerencséer, L. and Caines, P. E., editors, Topics in Stochastic Systems: Modelling, Estimation and Adaptive Control, pages 335–368. Springer Verlag, Heidelberg, Germany, 1991.Google Scholar

Lai, T. L. and Robbins, H.. Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22, 1985.Google Scholar

Lakshminarayanan, C. and Bhatnagar, S.. A stability criterion for two timescale stochastic approximation schemes. Automatica, 79:108–114, 2017.Google Scholar

Lakshminarayanan, C. and Szepesvari, C.. Linear stochastic approximation: how far does constant step-size and iterate averaging go? In Intl. Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.Google Scholar

Lange, S., Gabel, T., and Riedmiller, M.. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, Freiberg, Germany, 2012.Google Scholar

Lapeybe, B., Pages, G., and Sab, K.. Sequences with low discrepancy generalisation and application to Robbins–Monro algorithm. Statistics, 21(2):251–272, 1990.Google Scholar

Laruelle, S. and Pagès, G.. Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications, 18(1):1–51, 2012.Google Scholar

Lasry, J. M. and Lions, P. L.. Mean field games. Japan. J. Math., 2:229–260, 2007.Google Scholar

Lasserre, J.-B.. Moments, Positive Polynomials and Their Applications, volume 1. World Scientific, Hackensack, NJ, 2010.Google Scholar

Lattimore, T. and Szepesvari, C.. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.Google Scholar

Le Blanc, M.. Sur l’electrification des chemins de fer au moyen de courants alternatifs de frequence elevee [On the electrification of railways by means of alternating currents of high frequency]. Revue Generale de l’Electricite, 12(8):275–277, 1922.Google Scholar

Lee, D. and He, N.. Stochastic primal-dual Q-learning algorithm for discounted MDPs. In Proc. of the American Control Conf., pages 4897–4902, July 2019.Google Scholar

Lee, D. and He, N.. A unified switching system perspective and ODE analysis of Q-learning algorithms. arXiv, page arXiv:1912.02270, 2019.Google Scholar

Lee, J. and Sutton, R. S.. Policy iterations for reinforcement learning problems in continuous time and space – fundamental theory and methods. Automatica, 126:109421, 2021.Google Scholar

Lewis, F. L. and Liu, D.. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, volume 17. Wiley-IEEE Press, Hoboken, NJ, 2013.Google Scholar

Lewis, F. L., Vrabie, D., and Vamvoudakis, K. G.. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. Control Systems Magazine, 32(6):76–105, December 2012.Google Scholar

Lewis, M.. Flash Boys: A Wall Street Revolt. W. W. Norton & Company, New York, NY, 2014.Google Scholar

Li, L. and Fu, J.. Topological approximate dynamic programming under temporal logic constraints. In Proc. of the Conf. on Dec. and Control, pages 5330–5337, 2019.Google Scholar

Liggett, T. M.. Stochastic Interacting Systems: Contact, Voter and Exclusion Processes, volume 324. Springer Science & Business Media, New York, NY, 2013.Google Scholar

Lipp, T. and Boyd, S.. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.CrossRef Google Scholar

Littman, M. L. and Szepesvári, C.. A generalized reinforcement-learning model: convergence and applications. In Proc. ICML, volume 96, pages 310–318, 1996.Google Scholar

Liu, S. and Krstic, M.. Introduction to extremum seeking. In Stochastic Averaging and Stochastic Extremum Seeking, Communications and Control Engineering. Springer, London, UK, 2012.CrossRef Google Scholar

Ljung, L.. Analysis of recursive stochastic algorithms. Trans. on Automatic Control, 22(4):551–575, 1977.Google Scholar

Luenberger, D.. Linear and Nonlinear Programming. Kluwer Academic Publishers, Norwell, MA, 2nd ed., 2003.Google Scholar

Luenberger, D. G.. Optimization by Vector Space Methods. John Wiley & Sons Inc., New York, NY, 1969. Reprinted 1997.Google Scholar

Lund, R. B., Meyn, S. P., and Tweedie, R. L.. Computable exponential convergence rates for stochastically ordered Markov processes. Ann. Appl. Probab., 6(1):218–237, 1996.Google Scholar

MacKay, D. J. C.. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2003. Available from www.inference.phy.cam.ac.uk/mackay/itila/.Google Scholar

Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S.. Toward off-policy learning control with function approximation. In Proc. ICML, pages 719–726, Omnipress, Madison, WI, 2010.Google Scholar

Mandl, P.. Estimation and control in Markov chains. Advances in Applied Probability, 6(1):40–60, 1974.CrossRef Google Scholar

Mania, H., Guy, A., and Recht, B.. Simple random search provides a competitive approach to reinforcement learning. In Proc. Advances in Neural Information Processing Systems, pages 1800–1809, 2018.Google Scholar

Manne, A. S.. Linear programming and sequential decisions. Management Sci., 6(3):259–267, 1960.Google Scholar

Marbach, P. and Tsitsiklis, J. N.. Simulation-based optimization of Markov reward processes: implementation issues. In Proc. of the Conf. on Dec. and Control, volume 2, pages 1769–1774. IEEE, 1999.Google Scholar

Marbach, P. and Tsitsiklis, J. N.. Simulation-based optimization of Markov reward processes. IEEE Trans. Automat. Control, 46(2):191–209, 2001.CrossRef Google Scholar

Mareels, I. M., Anderson, B. D., Bitmead, R. R., Bodson, M., and Sastry, S. S.. Revisiting the MIT rule for adaptive control. In Aström, K.J. and Wittenmark, B., editors Adaptive Systems in Control and Signal Processing 1986, pages 161–166. Elsevier, Netherlands, 1987.Google Scholar

Matni, N., Proutiere, A., Rantzer, A., and Tu, S.. From self-tuning regulators to reinforcement learning and back again. In Proc. of the Conf. on Dec. and Control, pages 3724–3740, 2019.Google Scholar

Mayne, D., Rawlings, J., Rao, C., and Scokaert, P.. Constrained model predictive control: stability and optimality. Automatica, 36(6):789–814, 2000.Google Scholar

Mayne, D. Q.. Model predictive control: recent developments and future promise. Automatica, 50(12):2967–2986, 2014.Google Scholar

Mazumdar, E., Pacchiano, A., Ma, Y.-a., Bartlett, P. L., and Jordan, M. I.. On Thompson sampling with Langevin algorithms. arXiv e-prints, pages arXiv–2002, 2020.Google Scholar

Mehta, P. G. and Meyn, S. P.. Q-learning and Pontryagin's minimum principle. In Proc. of the Conf. on Dec. and Control, pages 3598–3605, December 2009.Google Scholar

Mehta, P. G. and Meyn, S. P.. Convex Q-learning, part 1: deterministic optimal control. ArXiv e-prints:2008.03559, 2020.Google Scholar

Mehta, P. G., Meyn, S. P., Neu, G., and Lu, F.. Convex Q-learning. In Proc. of the American Control Conf., pages 4749–4756, 2021.Google Scholar

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D.. On the global convergence rates of softmax policy gradient methods. arXiv eprint 2005.06392, 2020.Google Scholar

Melo, F. S., Meyn, S. P., and Ribeiro, M. I.. An analysis of reinforcement learning with function approximation. In Proc. ICML, pages 664–671, ACM, New York, NY, 2008.Google Scholar

Metivier, M. and Priouret, P.. Theoremes de convergence presque sure pour une classe d’algorithmes stochastiques a pas decroissants. Prob. Theory Related Fields, 74:403–428, 1987.Google Scholar

Meyer, C. D. Jr. The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review, 17(3):443–464, 1975.Google Scholar

Meyn, S. P.. Workload models for stochastic networks: value functions and performance evaluation. IEEE Trans. Automat. Control, 50(8):1106–1122, August 2005.Google Scholar

Meyn, S. P.. Large deviation asymptotics and control variates for simulating large functions. Ann. Appl. Probab., 16(1):310–339, 2006.Google Scholar

Meyn, S. P.. Control Techniques for Complex Networks. Cambridge University Press, 2007. Pre-publication ed. available online.Google Scholar

Meyn, S. P. and Mathew, G.. Shannon meets Bellman: feature based Markovian models for detection and optimization. In Proc. of the Conf. on Dec. and Control, pages 5558–5564, 2008.Google Scholar

Meyn, S. P. and Tweedie, R. L.. Computable bounds for convergence rates of Markov chains. Ann. Appl. Probab., 4:981–1011, 1994.Google Scholar

Meyn, S. P. and Tweedie, R. L.. Markov Chains and Stochastic Stability. Cambridge University Press, Cambridge, UK, 2nd ed., 2009. Published in the Cambridge Mathematical Library. 1993 ed. online.CrossRef Google Scholar

Michie, D. and Chambers, R. A.. Boxes: an experiment in adaptive control. Machine Intelligence, 2(2):137–152, 1968.Google Scholar

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K.. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A.. Playing Atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013.Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., etc. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.Google Scholar

Mohri, M., Rostamizadeh, A., and Talwalkar, A.. Foundations of Machine Learning. MIT Press, Cambridge, MA, 2018.Google Scholar

Molzahn, D. K., Dörfler, F., Sandberg, H., Low, S. H., Chakrabarti, S., Baldick, R., and Lavaei, J.. A survey of distributed optimization and control algorithms for electric power systems. Trans. on Smart Grid, 8(6):2941–2962, November 2017.Google Scholar

Moore, A. W.. Efficient Memory-Based Learning for Robot Control. PhD thesis, University of Cambridge, Computer Laboratory, 1990.Google Scholar

Mou, W., Junchi Li, C., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I.. On linear stochastic approximation: fine-grained Polyak–Ruppert and non-asymptotic concentration. arXiv e-prints, page arXiv:2004.04719, April 2020.Google Scholar

Moulines, E. and Bach, F. R.. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459, 2011.Google Scholar

Murphy, K. P.. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA, 2012.Google Scholar

Murray, R.. Feedback control theory: architectures and tools for real-time decision making. Tutorial series at the Simons Institute Program on Real-Time Decision Making. https://simons.berkeley.edu/talks/murray-control-1, January 2018.Google Scholar

Nachum, O. and Dai, B.. Reinforcement learning via Fenchel–Rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.Google Scholar

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D.. Bridging the gap between value and policy based reinforcement learning. In Proc. Advances Neural Information Processing Systems, volume 10, page 8, 2017.Google Scholar

Nedic, A. and Bertsekas, D.. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79–110, 2003.Google Scholar

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.Google Scholar

Nesterov, Y.. Lectures on Convex Optimization. Springer Optimization and Its Applications 137. Springer Intl. Publishing, New York, NY, 2018.Google Scholar

Nesterov, Y. and Spokoiny, V.. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.Google Scholar

Norris, J.. Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 1997.Google Scholar

Nowak, M. A.. Evolutionary Dynamics: Exploring the Equations of Life. Harvard University Press, Cambridge, MA, 2006.Google Scholar

Nummelin, E.. General Irreducible Markov Chains and Nonnegative Operators. Cambridge University Press, Cambridge, UK, 1984.Google Scholar

Oja, E.. A simplified neuron model as a principal component analyzer. J. Math. Biol., 15(3):267–273, 1982.Google Scholar

Ormoneit, D. and Glynn, P.. Kernel-based reinforcement learning in average-cost problems. Trans. on Automatic Control, 47(10):1624–1636, October 2002.Google Scholar

Orr, J. S. and Dennehy, C. J.. Analysis of the X-15 flight 3-65-97 divergent limit-cycle oscillation. Journal of Aircraft, 54(1):135–148, 2017.Google Scholar

Osband, I., Van Roy, B., and Wen, Z.. Generalization and exploration via randomized value functions. In Proc. ICML, pages 2377–2386, 2016.Google Scholar

Parikh, N. and Boyd, S.. Proximal Algorithms. Foundations and Trends in Optimization. Now Publishers, Norwell, MA, 2013.Google Scholar

Park, J. B. and Lee, J. Y.. Nonlinear adaptive control based on Lyapunov analysis: overview and survey. Journal of Institute of Control, Robotics and Systems, 20(3):261–269, 2014.Google Scholar

Perkins, T. J. and Barto, A. G.. Lyapunov design for safe reinforcement learning. J. Mach. Learn. Res., 3:803–832, 2003.Google Scholar

Peters, J., Vijayakumar, S., and Schaal, S.. Reinforcement learning for humanoid robotics. In Proc. of the IEEE-RAS International Conference on Humanoid Robots, pages 1–20, 2003.Google Scholar

Polyak, B. T.. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.Google Scholar

Polyak, B. T.. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.Google Scholar

Polyak, B. T. and Juditsky, A. B.. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.Google Scholar

Powell, W. B.. Reinforcement Learning and Stochastic Optimization. John Wiley & Sons, Hoboken, NJ, 2021.Google Scholar

Principe, J. C.. Information Theory, Machine Learning, and Reproducing Kernel Hilbert Spaces, pages 1–45. Springer New York, New York, NY, 2010.Google Scholar

Puterman, M. L.. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, 2014.Google Scholar

Qu, G. and Li, N.. On the exponential stability of primal-dual gradient dynamics. Control Systems Letters, 3(1):43–48, 2018.Google Scholar

Raginsky, M.. Divergence-based characterization of fundamental limitations of adaptive dynamical systems. In Conference on Communication, Control, and Computing, pages 107–114, 2010.Google Scholar

Raginsky, M. and Bouvrie, J.. Continuous-time stochastic mirror descent on a network: variance reduction, consensus, convergence. In Proc. of the Conf. on Dec. and Control, pages 6793–6800, 2012.Google Scholar

Raginsky, M. and Rakhlin, A.. Information-based complexity, feedback and dynamics in convex programming. Transactions on Information Theory, 57(10):7036–7056, 2011.Google Scholar

Ramaswamy, A. and Bhatnagar, S.. A generalization of the Borkar–Meyn theorem for stochastic recursive inclusions. Math. Oper. Res., 42(3):648–661, 2017.Google Scholar

Ramaswamy, A. and Bhatnagar, S.. Stability of stochastic approximations with “controlled Markov” noise and temporal difference learning. Trans. on Automatic Control, 64:2614–2620, 2019.Google Scholar

Rastrigin, L.. Extremum control by means of random scan. Avtomat. i Telemekh, 21(9):1264–1271, 1960.Google Scholar

Rastrigin, L. A.. Random search in problems of optimization, identification and training of control systems. Journal of Cybernetics, 3(3):93–103, 1973.Google Scholar

Research Staff. Experience with the X-15 adaptive flight control system. TN D-6208, NASA Flight Research Center, Edwards, CA, 1971.Google Scholar

Robbins, H. and Monro, S.. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.Google Scholar

Rosenthal, J. S.. Correction: “Minorization conditions and convergence rates for Markov chain Monte Carlo.” J. Amer. Statist. Assoc., 90(431):1136, 1995.CrossRef Google Scholar

Rosenthal, J. S.. Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc., 90(430):558–566, 1995.Google Scholar

Rudin, W.. Real and Complex Analysis. McGraw-Hill, New York, NY, 2nd ed., 1974.Google Scholar

Ruppert, D.. A Newton–Raphson version of the multivariate Robbins–Monro procedure. The Annals of Statistics, 13(1):236–245, 1985.Google Scholar

Ruppert, D.. Efficient estimators from a slowly convergent Robbins–Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988.Google Scholar

Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z.. A Tutorial on Thompson Sampling. Now Publishers Inc., Norwell, MA, 2018.Google Scholar

Rybko, A. N. and Stolyar, A. L.. On the ergodicity of random processes that describe the functioning of open queueing networks. Problemy Peredachi Informatsii, 28(3):3–26, 1992.Google Scholar

Sacks, J.. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):373–405, 1958.CrossRef Google Scholar

Schrittwieser, J., Antonoglou, I., Hubert, T., et al. Mastering Atari, Go, chess and Shogi by planning with a learned model. ArXiv, abs/1911.08265, 2019.Google Scholar

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P.. Trust region policy optimization. In Intl. Conference on Machine Learning, pages 1889–1897, 2015.Google Scholar

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.Google Scholar

Schweitzer, P. J.. Perturbation theory and finite Markov chains. J. Appl. Prob., 5:401–403, 1968.Google Scholar

Schweitzer, P. J. and Seidmann, A.. Generalized polynomial approximations in Markovian decision processes. Journal of Mathematical Analysis and Applications, 110(2):568–582, 1985.Google Scholar

Seneta, E.. Non-Negative Matrices and Markov Chains. Springer, New York, NY, 2nd ed., 1981.Google Scholar

Shannon, C.. A mathematical theory of communication. Bell System Tech. J., 27:379–423, 623–656, 1948.Google Scholar

Sharma, H., Jain, R., and Gupta, A.. An empirical relative value learning algorithm for non-parametric MDPs with continuous state space. In European Control Conference, pages 1368–1373. IEEE, 2019.Google Scholar

Shi, B., Du, S. S., Su, W., and Jordan, M. I.. Acceleration via symplectic discretization of high-resolution differential equations. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R., editors, Proc. Advances in Neural Information Processing Systems, pages 5744–5752, 2019.Google Scholar

Shirodkar, S. and Meyn, S.. Quasi stochastic approximation. In Proc. of the American Control Conf., pages 2429–2435, July 2011.Google Scholar

Shivam, S., Buckley, I., Wardi, Y., Seatzu, C., and Egerstedt, M.. Tracking control by the Newton– Raphson flow: applications to autonomous vehicles. CoRR, abs/1811.08033, 2018.Google Scholar

Sikora, R. and Skarbek, W.. On stability of Oja algorithm. In Polkowski, L. and Skowron, A., editors, Rough Sets and Current Trends in Computing, volume 1424 of Lecture Notes in Computer Science, pages 354–360. Springer Verlag, Berlin 2009.Google Scholar

Silver, D., Hubert, T., Schrittwieser, J., et al. A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.Google Scholar

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M.. Deterministic policy gradient algorithms. In Proc. ICML, pages 387–395, 2014.Google Scholar

Singh, S. P., Jaakkola, T., and Jordan, M.. Reinforcement learning with soft state aggregation. Proc. Advances in Neural Information Processing Systems, 7:361, 1995.Google Scholar

Smale, S.. A convergent process of price adjustment and global Newton methods. Journal of Mathematical Economics, 3(2):107–120, July 1976.Google Scholar

Smallwood, R. D. and Sondik, E. J.. The optimal control of partially observable Markov processes over a finite horizon. Oper. Res., 21(5):1071–1088, October 1973.Google Scholar

Spall, J. C.. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.Google Scholar

Spall, J. C.. A stochastic approximation technique for generating maximum likelihood parameter estimates. In Proc. of the American Control Conf., pages 1161–1167. IEEE, 1987.Google Scholar

Spall, J. C.. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, 1997.CrossRef Google Scholar

Spong, M. W. and Block, D. J.. The pendubot: a mechatronic system for control research and education. In Proc. of the Conf. on Dec. and Control, pages 555–556. IEEE, 1995.Google Scholar

Spong, M. W. and Praly, L.. Control of underactuated mechanical systems using switching and saturation. In Morse, A. S., editor, Control Using Logic-Based Switching, pages 162–172. Springer, Berlin, Heidelberg 1997.CrossRef Google Scholar

Spong, M. W. and Vidyasagar, M.. Robot Dynamics and Control. John Wiley & Sons, Chichester, UK, 2008.Google Scholar

Srikant, R. and Ying, L.. Finite-time error bounds for linear stochastic approximation and TD learning. In Proc. COLT, pages 2803–2830, 2019.Google Scholar

Stratonovich, R. L.. Conditional Markov processes. SIAM J. Theory Probab. and Appl., 5:156–178, 1960.Google Scholar

Su, W., Boyd, S., and Candes, E.. A differential equation for modeling nesterov's accelerated gradient method: theory and insights. In Proc. Advances in Neural Information Processing Systems, pages 2510–2518, 2014.Google Scholar

Subramanian, J. and Mahajan, A.. Approximate information state for partially observed systems. In Proc. of the Conf. on Dec. and Control, pages 1629–1636. IEEE, 2019.Google Scholar

Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A.. Approximate information state for approximate planning and reinforcement learning in partially observed systems. arXiv:2010.08843, 2020.Google Scholar

Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Online ed. at www.cs.ualberta.ca/~sutton/book/the-book.html, Cambridge, MA, 2nd ed., 2018.Google Scholar

Sutton, R. S.. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 1984.Google Scholar

Sutton, R. S.. Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988.CrossRef Google Scholar

Sutton, R. S.. Generalization in reinforcement learning: successful examples using sparse coarse coding. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 1038–1044, 1995.Google Scholar

Sutton, R. S. and Barto, A. G.. Toward a modern theory of adaptive networks: expectation and prediction. Psychological Review, 88(2):135, 1981.Google Scholar

Sutton, R. S., Barto, A. G., and Williams, R. J.. Reinforcement learning is direct adaptive optimal control. Control Systems Magazine, 12(2):19–22, 1992.Google Scholar

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour., Y. Policy gradient methods for reinforcement learning with function approximation. In Proc. Advances in Neural Information Processing Systems, pages 1057–1063, 2000.Google Scholar

Sutton, R. S., Szepesvári, C., and Maei, H. R.. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 1609–1616, 2008.Google Scholar

Szepesvári, C.. The asymptotic convergence-rate of Q-learning. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 1064–1070, 1997.Google Scholar

Szepesvári, C.. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, San Raphael, CA, 2010.Google Scholar

Tan, Y., Moase, W. H., Manzie, C., Nešić, D., and Mareels, I.. Extremum seeking from 1922 to 2010. In Proc. of the 29th Chinese Control Conference, pages 14–26. IEEE, 2010.Google Scholar

Tanzanakis, A. and Lygeros, J.. Data-driven control of unknown systems: a linear programming approach. ArXiv, abs/2003.00779, 2020.Google Scholar

Tesauro, G.. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215–219, 1994.Google Scholar

Thoppe, G. and Borkar, V.. A concentration bound for stochastic approximation via Alekseev's formula. Stochastic Systems, 9(1):1–26, 2019.Google Scholar

Tsitsiklis, J.. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.Google Scholar

Tsitsiklis, J. and van Roy, B.. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.Google Scholar

Tsitsiklis, J. N. and Roy, B. V.. Average cost temporal-difference learning. Automatica, 35(11): 1799–1808, 1999.CrossRef Google Scholar

Tsitsiklis, J. N. and Van Roy, B.. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59–94, 1996.Google Scholar

Tsitsiklis, J. N. and Van Roy, B.. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.Google Scholar

Tsypkin, Y. Z. and Nikolic, Z. J.. Adaptation and Learning in Automatic Systems. Academic Press, New York, NY, 1971.Google Scholar

Tzen, B. and Raginsky, M.. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Beygelzimer, A. and Hsu, D., editors, Proc. COLT, volume 99, pages 3084–3114, 2019.Google Scholar

Vamvoudakis, K. G., Lewis, F. L., and Vrabie, D.. Reinforcement learning with applications in autonomous control and game theory. In Angelov, P., editor, Handbook on Computer Learning and Intelligence. World Scientific, Hackensack, NJ, 2nd ed., 2021.Google Scholar

Vamvoudakis, K. G., Wan, Y., Lewis, F. L., and Cansever, D., editors. Handbook on Reinforcement Learning and Control. Studies in Systems, Decision and Control (SSDC), volume 325. Springer, Princeton, NJ, 2021.Google Scholar

van der Vaart, A. W.. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, UK, 1998.Google Scholar

van Handel, R.. Lecture notes on hidden Markov models. https://web.math.princeton.edu/~rvan/, 2008.Google Scholar

Van Roy, B.. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998. AAI0599623.Google Scholar

Vandenberghe, L. and Boyd, S.. Applications of semidefinite programming. Applied Numerical Mathematics, 29(3):283–299, 1999.CrossRef Google Scholar

Vapnik, V.. Estimation of Dependences Based on Empirical Data. Springer Science & Business Media, New York, NY, 2006.Google Scholar

Venter, J. et al. An extension of the Robbins–Monro procedure. The Annals of Mathematical Statistics, 38(1):181–190, 1967.Google Scholar

Vinter, R.. Convex duality and nonlinear optimal control. SIAM Journal on Control and Optimization, 31(2):518–21, 03 1993.Google Scholar

Walton, N.. A short note on soft-max and policy gradients in bandits problems. arXiv preprint arXiv:2007.10297, 2020.Google Scholar

Wang, Y. and Boyd, S.. Performance bounds for linear stochastic control. Systems Control Lett., 58(3):178–182, 2009.Google Scholar

Wardi, Y., Seatzu, C., Egerstedt, M., and Buckley, I.. Performance regulation and tracking via lookahead simulation: preliminary results and validation. In Proc. of the Conf. on Dec. and Control, pages 6462–6468, 2017.Google Scholar

Watkins, C. J. C. H.. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, 1989.Google Scholar

Watkins, C. J. C. H. and Dayan, P.. Q-learning. Machine Learning, 8(3-4):279–292, 1992.Google Scholar

Weber, B.. Swift and slashing, computer topples Kasparov. New York Times, 12:262, 1997.Google Scholar

Whittle, P.. Risk-Sensitive Optimal Control. John Wiley and Sons, Chichester, NY, 1990.Google Scholar

Wibisono, A., Wilson, A. C., and Jordan, M. I.. A variational perspective on accelerated methods in optimization. Proc. of the National Academy of Sciences, 113:E7351–E7358, 2016.Google Scholar

Williams, R. J.. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.CrossRef Google Scholar

Witten, I. H.. An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34(4):286–295, 1977.Google Scholar

Wu, L.. Essential spectral radius for Markov semigroups. I. Discrete time case. Prob. Theory Related Fields, 128(2):255–321, 2004.Google Scholar

Yaji, V. G. and Bhatnagar, S.. Stochastic recursive inclusions with non-additive iterate-dependent Markov noise. Stochastics, 90(3):330–363, 2018.Google Scholar

Yin, H., Mehta, P., Meyn, S., and Shanbhag, U.. Synchronization of coupled oscillators is a game. IEEE Transactions on Automatic Control, 57(4):920–935, 2012.Google Scholar

Yin, H., Mehta, P., Meyn, S., and Shanbhag, U.. Learning in mean-field games. IEEE Transactions on Automatic Control, 59(3):629–644, March 2014.Google Scholar

Yin, H., Mehta, P. G., Meyn, S. P., and Shanbhag, U. V.. On the efficiency of equilibria in mean-field oscillator games. Dynamic Games and Applications, 4(2):177–207, 2014.Google Scholar

Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M.. Variational policy gradient method for reinforcement learning with general utilities. Proc. Advances in Neural Information Processing Systems, 33:4572–4583, 2020.Google Scholar

Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A.. Direct Runge–Kutta discretization achieves acceleration. In Proc. of the Intl. Conference on Neural Information Processing Systems, pages 3904–3913, 2018.Google Scholar

Zhao, J. and Spong, M.. Hybrid control for global stabilization of the cart–pendulum system. Automatica, 37(12):1941–1951, 2001.Google Scholar

Zhou, K., Doyle, J. C., and Glover, K.. Robust and Optimal Control. Prentice Hall, Englewood Cliffs, NJ, 1996.Google Scholar

Book contents

References

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive