Bibliography

Tor Lattimore; Csaba Szepesvári

doi:10.1017/9781108571401.048

Bibliography

Published online by Cambridge University Press: 04 July 2020

Tor Lattimore and

Csaba Szepesvári

Show author details

Tor Lattimore: Affiliation:
University of Alberta
Csaba Szepesvári: Affiliation:
University of Alberta

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: Bandit Algorithms , pp. 484 - 512

DOI: https://doi.org/10.1017/9781108571401.048 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abbasi-Yadkori, Y.. Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta, 2009a. [213]Google Scholar

Abbasi-Yadkori, Y.. Forced-exploration based algorithms for playing in bandits with large action sets. Master’s thesis, University of Alberta, Department of Computing Science, 2009b. [79]Google Scholar

Abbasi-Yadkori, Y.. Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta, 2012. [214, 475]Google Scholar

Abbasi-Yadkori, Y. and Szepesvári, Cs.. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Conference on Learning Theory, pages 1–26, Budapest, Hungary, 2011. JMLR.org. [475]Google Scholar

Abbasi-Yadkori, Y. and Szepesvári, Cs.. Bayesian optimal control of smoothly parameterized systems. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pages 2–11, Arlington, VA, United States, 2015. AUAI Press. [475]Google Scholar

Abbasi-Yadkori, Y., Antos, A., and Szepesvári, Cs.. Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. [79, 213]Google Scholar

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, Cs.. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320. Curran Associates, Inc., 2011. [213]Google Scholar

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, Cs.. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 1–9, La Palma, Canary Islands, 2012. JMLR.org. [249]Google Scholar

Abbasi-Yadkori, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesvári, Cs.. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems, pages 2508–2516, USA, 2013. Curran Associates Inc. [475]Google Scholar

Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., and Valko, M.. Best of both worlds: Stochastic & adversarial best-arm identification. In Proceedings of the 31st Conference on Learning Theory, 2018. [365]Google Scholar

Abe, N. and Long, P. M.. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the 16th International Conference on Machine Learning, pages 3–11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [213]Google Scholar

Abeille, M. and Lazaric, A.. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 176–184, Fort Lauderdale, FL, USA, 2017a. JMLR.org. [417]Google Scholar

Abeille, M. and Lazaric, A.. Thompson sampling for linear-quadratic control problems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1246–1254, Fort Lauderdale, FL, USA, 2017b. JMLR.org. [475]Google Scholar

Abernethy, J. D. and Rakhlin, A.. Beating the adaptive bandit with high probability. In Proceedings of the 22nd Conference on Learning Theory, 2009. [149, 301]Google Scholar

Abernethy, J. D., Hazan, E., and Rakhlin, A.. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Conference on Learning Theory, pages 263–274. Omnipress, 2008. [301]Google Scholar

Abernethy, J. D., Hazan, E., and Rakhlin, A.. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):4164–4175, 2012. [148, 299]Google Scholar

Abernethy, J. D., Lee, C., Sinha, A., and Tewari, A.. Online linear optimization via smoothing. In Proceedings of the 27th Conference on Learning Theory, pages 807–823, Barcelona, Spain, 2014. JMLR.org. [328]Google Scholar

Abernethy, J. D., Lee, C., and Tewari, A.. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, pages 2197–2205. Curran Associates, Inc., 2015. [301, 328]Google Scholar

Abramowitz, M. and Stegun, I. A.. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964. [158, 418]Google Scholar

Achab, M., Clémençon, S., Garivier, A., Sabourin, A., and Vernade, C.. Max k-armed bandit: On the extremehunter algorithm and beyond. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 389–404. Springer, 2017. [365]CrossRef Google Scholar

Adelman, L.. Choice theory. In Gass, Saul I. and Fu, Michael C., editors, Encyclopedia of Operations Research and Management Science, pages 164–168. Springer US, Boston, MA, 2013. [54]Google Scholar

Agarwal, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Rakhlin, A.. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 1035–1043. Curran Associates, Inc., 2011. [315]Google Scholar

Agarwal, A., Foster, D. P., Hsu, D., Kakade, S. M., and Rakhlin, A.. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213–240, 2013. [364]CrossRef Google Scholar

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R.. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 1638–1646, Bejing, China, 2014. JMLR.org. [201, 202]Google Scholar

Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford, J., Lee, S., Li, J., Melamed, D., Oshri, G., and Ribas, O.. Making contextual decisions with low technical debt. arXiv:1606.03966, 2016. [11]Google Scholar

Agrawal, R.. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pages 1054–1078, 1995. [92, 100]Google Scholar

Agrawal, S. and Devanur, N. R.. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM conference on Economics and computation, pages 989–1006. ACM, 2014. [315]Google Scholar

Agrawal, S. and Devanur, N. R.. Linear contextual bandits with knapsacks. In Advances in Neural Information Processing Systems, pages 3458–3467. Curran Associates Inc., 2016. [315]Google Scholar

Agrawal, S. and Goyal, N.. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Conference on Learning Theory, 2012. [416]Google Scholar

Agrawal, S. and Goyal, N.. Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, Scottsdale, Arizona, USA, 2013a. JMLR.org. [415, 417]Google Scholar

Agrawal, S. and Goyal, N.. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, Atlanta, GA, USA, 2013b. JMLR.org. [417]Google Scholar

Agrawal, S. and Jia, R.. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194. Curran Associates, Inc., 2017. [474]Google Scholar

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A.. Thompson sampling for the MNL-bandit. In Proceedings of the 2017 Conference on Learning Theory, pages 76–78, Amsterdam, Netherlands, 2017. JMLR.org. [417]Google Scholar

Ailon, N., Karnin, Z., and Joachims, T.. Reducing dueling bandits to cardinal bandits. In Proceedings of the 31st International Conference on Machine Learning, pages II–856–II–864. JMLR.org, 2014. [315]Google Scholar

Aldrich, J.. “but you have to remember P. J. Daniell of Sheffield”. Electronic Journal for History of Probability and Statistics, 3(2), 2007. [42]Google Scholar

Allenberg, C., Auer, P., Györfi, L., and Ottucsák, G.. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th International Conference on Algorithmic Learning Theory, pages 229–243, Berlin, Heidelberg, 2006. Springer-Verlag. [135, 148, 299]Google Scholar

Alon, N., Matias, Y., and Szegedy, M.. The space complexity of approximating the frequency moments. In Proceedings of the 28th annual ACM symposium on theory of computing, pages 20–29. ACM, 1996. [96]Google Scholar

Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y.. From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610–1618. Curran Associates, Inc., 2013. [316, 448]Google Scholar

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T.. Online learning with feedback graphs: Beyond bandits. In Proceedings of the 28th Conference on Learning Theory, pages 23–35, Paris, France, 2015. JMLR.org. [316]Google Scholar

Anantharam, V., Varaiya, P., and Walrand, J.. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968–976, 1987. [214]Google Scholar

Anderson, J. R., Dillon, J. L., and Hardaker, J. E.. Agricultural decision analysis. Monographs: Applied Economics. Iowa State University Press, 1977. [xiii]Google Scholar

Anscombe, F. J.. Sequential medical trials. Journal of the American Statistical Association, 58(302): 365–383, 1963. [79]Google Scholar

Antos, A., Bartók, G., Pál, D., and Szepesvári, Cs.. Toward a classification of finite partial-monitoring games. Theoretical Computer Science, 473:77–99, 2013. [448]Google Scholar

Arapostathis, A., Borkar, V. S., Fernandez-Gaucherand, E., Ghosh, M. K., and Marcus, S. I.. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM Journal of Control and Optimization, 31(2):282–344, 1993. [474]CrossRef Google Scholar

Arora, R., Dekel, O., and Tewari, A.. Online bandit learning against an adaptive adversary: From regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, Madison, WI, USA, 2012. Omnipress. [136]Google Scholar

Ashwinkumar, B., Langford, J., and Slivkins, A.. Resourceful contextual bandits. In Proceedings of the 27th Conference on Learning Theory, pages 1109–1134, Barcelona, Spain, 2014. JMLR.org. [315]Google Scholar

Audibert, J.-V. and Bubeck, S.. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2785–2836, 2010a. [136]Google Scholar

Audibert, J.-Y. and Bubeck, S.. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Conference on Learning Theory, pages 217–226, 2009. [108, 136, 305]Google Scholar

Audibert, J.-Y. and Bubeck, S.. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Conference on Learning Theory, 2010b. [338, 364]Google Scholar

Audibert, J.-Y., Munos, R., and Szepesvári, Cs.. Tuning bandit algorithms in stochastic environments. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages 150–165, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. [56, 70, 92, 95, 183]Google Scholar

Audibert, J.-Y., Munos, R., and Szepesvári, Cs.. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. [56]Google Scholar

Audibert, J.-Y., Bubeck, S., and Lugosi, G.. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2013. [301]Google Scholar

Auer, P.. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422, 2002. [213, 238]Google Scholar

Auer, P. and Chiang, C.. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Proceedings of the 29th Annual Conference on Learning Theory, pages 116–120, New York, NY, USA, 2016. JMLR.org. [136]Google Scholar

Auer, P. and Ortner, R.. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49–56. MIT Press, 2007. [475]Google Scholar

Auer, P. and Ortner, R.. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010. [82, 108]Google Scholar

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995 . Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995. [81, 125, 137, 174]Google Scholar

Auer, P., Cesa-Bianchi, N., and Fischer, P.. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002a. [79, 92]Google Scholar

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b. [148, 149, 175, 201, 338]Google Scholar

Auer, P., Ortner, R., and Szepesvári, Cs.. Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory, pages 454–468. Springer, 2007. [314]Google Scholar

Auer, P., Jaksch, T., and Ortner, R.. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89–96, 2009. [474]Google Scholar

Auer, P., Gajane, P., and Ortner, R.. Adaptively tracking the best arm with an unknown number of distribution changes. In European Workshop on Reinforcement Learning 14, 2018. [338]Google Scholar

Auer, P., Gajane, P., and Ortner, R.. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the 32nd Conference on Learning Theory, 2019. [337, 338]Google Scholar

Awerbuch, B. and Kleinberg, R.. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th annual ACM symposium on theory of computing, pages 45–53. ACM, 2004. [328]Google Scholar

Axler, S. J.. Linear algebra done right, volume 2. Springer, 1997. [445]Google Scholar

Azar, M. G., Osband, I., and Munos, R.. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, Sydney, Australia, 06–11 Aug 2017. JMLR.org. [474]Google Scholar

Badanidiyuru, A., Kleinberg, R., and Slivkins, A.. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 207–216. IEEE, 2013. [315]Google Scholar

Bartlett, P. L. and Tewari, A.. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 35–42, Arlington, VA, United States, 2009. AUAI Press. [479]Google Scholar

Bartók, G.. A near-optimal algorithm for finite partial-monitoring games against adversarial opponents. In Proceedings of the 26th Conference on Learning Theory, pages 696–710. JMLR.org, 2013. [447, 448]Google Scholar

Bartók, G. and Szepesvári, Cs.. Partial monitoring with side information. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pages 305–319, 2012. [448]Google Scholar

Bartók, G., Pál, D., and Szepesvári, Cs.. Toward a classification of finite partial-monitoring games. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, pages 224–238. Springer, 2010. [448]Google Scholar

Bartók, G., Zolghadr, N., and Szepesvári, Cs.. An adaptive algorithm for finite stochastic partial monitoring. In Proceedings of the 29th International Conference on Machine Learning, pages 1779–1786, USA, 2012. Omnipress. [448]Google Scholar

Bartók, G., Foster, D. P., Pál, D., Rakhlin, A., and Szepesvári, Cs.. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967–997, 2014. [448]Google Scholar

Bastani, H. and Bayati, M.. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020. [249]CrossRef Google Scholar

Bather, J. A. and Chernoff, H.. Sequential decisions in the control of a spaceship. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 181–207, 1967. [10]Google Scholar

Bayes. LII, T.. An essay towards solving a problem in the doctrine of chances. by the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philosophical transactions of the Royal Society of London, 53:370–418, 1763. [382]Google Scholar

Bellman, R.. The theory of dynamic programming. Technical report, RAND CORP SANTA MONICA CA, 1954. [473]Google Scholar

Bellman, R. E.. Eye of the Hurricane. World Scientific, 1984. [473]Google Scholar

Berend, D. and Kontorovich, A.. On the concentration of the missing mass. Electronic Communications in Probability, 18(3):1–7, 2013. [69]Google Scholar

Berger, J. O.. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 1985. [380]Google Scholar

Bernoulli, D.. Exposition of a new theory on the measurement of risk. Econometrica: Journal of the Econometric Society, pages 23–36, 1954. [54]Google Scholar

Berry, A. C.. The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American mathematical society, 49(1):122–136, 1941. [64]Google Scholar

Berry, D. and Fristedt, B.. Bandit problems : sequential allocation of experiments. Chapman and Hall, London; New York, 1985. [11, 400, 401]Google Scholar

Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A.. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103–2116, 1997. [314]Google Scholar

Bertsekas, D. and Tsitsiklis, J. N.. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. [474]Google Scholar

Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume 1-2. Athena Scientific, Belmont, MA, 4 edition, 2012. [472, 473, 474]Google Scholar

Bertsekas, D. P.. Convex optimization algorithms. Athena Scientific Belmont, 2015. [329]Google Scholar

Bertsimas, D. and Tsitsiklis, J. N.. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997. [474]Google Scholar

Besbes, O., Gur, Y., and Zeevi, A.. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pages 199–207. Curran Associates, Inc., 2014. [338]Google Scholar

Besson, L. and Kaufmann, E.. What doubling tricks can and can’t do for multi-armed bandits. arXiv:1803.06971, 2018. [81]Google Scholar

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. E.. An optimal high probability algorithm for the contextual bandit problem. arXiv:1002.4058, 2010. [148]Google Scholar

Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R.. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 19–26, Fort Lauderdale, FL, USA, 2011. JMLR.org. [201, 204]Google Scholar

Billingsley, P.. Probability and measure. John Wiley & Sons, 2008. [32, 42]Google Scholar

Blackwell, D.. Controlled random walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336–338, 1954. [125]Google Scholar

Bogachev, V. I.. Measure theory, volume 2. Springer Science & Business Media, 2007. [33, 277]Google Scholar

Bonald, T. and Proutiere, A.. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In Advances in Neural Information Processing Systems, pages 2184–2192, 2013. [314]Google Scholar

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E.. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013. [149]Google Scholar

Boucheron, S., Lugosi, G., and Massart, P.. Concentration inequalities: A nonasymptotic theory of independence. OUP Oxford, 2013. [66, 300]Google Scholar

Bouneffouf, D. and Rish, I.. A survey on practical applications of multi-armed and contextual bandits. arXiv:1904.10040, 2019. [11]Google Scholar

Box, G. E. P.. Science and statistics. Journal of the American Statistical Association, 71(356):791–799, 1976. [125]Google Scholar

Box, G. E. P.. Robustness in the strategy of scientific model building. Robustness in statistics,1: 201–236, 1979. [125]Google Scholar

Boyd, S. and Vandenberghe, L.. Convex optimization. Cambridge University Press, 2004. [275]Google Scholar

Bradt, R. N., Johnson, S. M., and Karlin, S.. On sequential designs for maximizing the sum of n observations. The Annals of Mathematical Statistics, pages 1060–1074, 1956. [401]Google Scholar

Brafman, R. and Tennenholtz, M.. R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2003. [475]Google Scholar

Bretagnolle, J. and Huber, C.. Estimation des densités: risque minimax. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 47(2):119–137, 1979. [167]Google Scholar

Bubeck, S. and Cesa-Bianchi, N.. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. [10, 92, 136, 301, 327]Google Scholar

Bubeck, S. and Eldan, R.. The entropic barrier: a simple and optimal universal self-concordant barrier. In Proceedings of the 28th Conference on Learning Theory, pages 279–279, Paris, France, 2015. JMLR.org. [284]Google Scholar

Bubeck, S. and Eldan, R.. Multi-scale exploration of convex functions and bandit convex optimization. In Proceedings of the 29th Conference on Learning Theory, pages 583–589, New York, NY, USA, 2016. JMLR.org. [315, 416, 417]Google Scholar

Bubeck, S. and Liu, C.. Prior-free and prior-dependent regret bounds for Thompson sampling. In Advances in Neural Information Processing Systems, pages 638–646. Curran Associates, Inc., 2013. [417]Google Scholar

Bubeck, S. and Slivkins, A.. The best of both worlds: Stochastic and adversarial bandits. In Proceedings of the 25th Conference on Learning Theory, pages 42.1–42.23, 2012. [136]Google Scholar

Bubeck, S., Munos, R., and Stoltz, G.. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009. [364]Google Scholar

Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, Cs.. X-armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011. [314]Google Scholar

Bubeck, S., Cesa-Bianchi, N., and Kakade, S.. Towards minimax policies for online linear optimization with bandit feedback. In Proceedings of the 25th Conference on Learning Theory, pages 41–1. Microtome, 2012. [238, 283, 301]Google Scholar

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G.. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013a. [96]Google Scholar

Bubeck, S., Perchet, V., and Rigollet, P.. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory, pages 122–134, Princeton, NJ, USA, 2013b. JMLR.org. [174]Google Scholar

Bubeck, S., Dekel, O., Koren, T., and Peres, Y.. Bandit convex optimization: T regret in one dimension. In Proceedings of the 28th Conference on Learning Theory, pages 266–278, Paris, France, 2015a. JMLR.org. [315, 416, 417]Google Scholar

Bubeck, S., Eldan, R., and Lehec, J.. Finite-time analysis of projected Langevin Monte Carlo. In Advances in Neural Information Processing Systems, pages 1243–1251. Curran Associates, Inc., 2015b. [284, 299]Google Scholar

Bubeck, S., Lee, Y.T., and Eldan, R.. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 72–85, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4528-6. [315]Google Scholar

Bubeck, S., Cohen, M., and Li, Y.. Sparsity, variance and curvature in multi-armed bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory, pages 111–127. JMLR.org, 07–09 Apr 2018. [148, 298, 301]Google Scholar

Burnetas, A. N. and Katehakis, M. N.. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996. [100, 119, 181]Google Scholar

Burnetas, A. N. and Katehakis, M. N.. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997a. [475]Google Scholar

Burnetas, A. N. and Katehakis, M. N.. On the finite horizon one-armed bandit problem. Stochastic Analysis and Applications, 16(1):845–859, 1997b. [401]Google Scholar

Burnetas, A. N. and Katehakis, M. N.. Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probability in the Engineering and Informational Sciences, 17(1):53–82, 2003. [401]Google Scholar

Bush, R. R. and Mosteller, F.. A stochastic model with applications to learning. The Annals of Mathematical Statistics, pages 559–585, 1953. [10]Google Scholar

Cappé, O., Garivier, A., Maillard, O., Munos, R., and Stoltz, G.. Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013. [100, 119, 120, 181]Google Scholar

Carpentier, A. and Locatelli, A.. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning Theory, pages 590–604, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar

Carpentier, A. and Munos, R.. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 190–198, La Palma, Canary Islands, 2012. JMLR.org. [249, 311]Google Scholar

Carpentier, A. and Valko, M.. Extreme bandits. In Advances in Neural Information Processing Systems, pages 1089–1097. Curran Associates, Inc., 2014. [365]Google Scholar

Carpentier, A. and Valko, M.. Simple regret for infinitely many armed bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 1133–1141, Lille, France, 2015. PMLR. [314]Google Scholar

Catoni, O.. Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1148–1185, 2012. [96]Google Scholar

Cesa-Bianchi, N. and Lugosi, G.. Prediction, learning, and games. Cambridge University Press, 2006. [10, 136, 301, 338, 381, 448]Google Scholar

Cesa-Bianchi, N. and Lugosi, G.. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012. [327]Google Scholar

Cesa-Bianchi, N., Lugosi, G., and Stoltz, G.. Regret minimization under partial monitoring. Mathematics of Operations Research, 31:562–580, 2006. [448]Google Scholar

Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A.. Delay and cooperation in nonstochastic bandits. In Proceedings of the 29th Conference on Learning Theory, pages 605–622, New York, NY, USA, 2016. JMLR.org. [316]Google Scholar

Cesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G.. Boltzmann exploration done right. In Advances in Neural Information Processing Systems, pages 6284–6293. Curran Associates, Inc., 2017. [79]Google Scholar

Chakravorty, J. and Mahajan, A.. Multi-armed bandits, Gittins index, and its calculation. Methods and Applications of Statistics in Clinical Trials: Planning, Analysis, and Inferential Methods,2: 416–435, 2013. [402]Google Scholar

Chan, H. P. and Lai, T. L.. Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Analysis, 25:179–201, 2006. [365]Google Scholar

Chang, J. T. and Pollard, D.. Conditioning as disintegration. Statistica Neerlandica, 51(3):287–317, 1997. [382]Google Scholar

Chapelle, O. and Li, L.. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257. Curran Associates, Inc., 2011. [416]Google Scholar

Chaudhuri, S. and Tewari, A.. Phased exploration with greedy exploitation in stochastic combinatorial partial monitoring games. In Advances in Neural Information Processing Systems, pages 2433–2441, 2016. [448]Google Scholar

Chen, C-H., Lin, J., Yücesan, E., and Chick, S. E.. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, 10(3):251–270, 2000. [365]Google Scholar

Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W.. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379–387. Curran Associates, Inc., 2014. [364]Google Scholar

Chen, W., Wang, Y., and Yuan, Y.. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pages 151–159, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. [328]Google Scholar

Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P.. Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 1659–1667. Curran Associates, Inc., 2016a. [328]Google Scholar

Chen, W., Wang, Y., Yuan, Y., and Wang, Q.. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1–33, 2016b. URL http://jmlr.org/papers/v17/14-298.html. [328]Google Scholar

Chen, Y., Lee, C-W., Luo, H., and Wei, C-Y.. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. arXiv:1902.00980, 2019. [337, 338]Google Scholar

Chen, Y. R. and Katehakis, M. N.. Linear programming for finite state multi-armed bandit problems. Mathematics of Operations Research, 11(1):180–183, 1986. [402]Google Scholar

Chernoff, H.. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755– 770, 1959. [11, 364]CrossRef Google Scholar

Chernoff, H.. A career in statistics. Past, Present, and Future of Statistical Science, page 29, 2014. [119]Google Scholar

Cheung, W., Simchi-Levi, D., and Zhu, R.. Learning to optimize under non-stationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 16–18 Apr 2019. [338]Google Scholar

Chu, W., Li, L., Reyzin, L., and Schapire, R.. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 208–214, Fort Lauderdale, FL, USA, 2011. JMLR.org. [238]Google Scholar

Chuklin, A., Markov, I., and de Rijke, M.. Click Models for Web Search. Morgan & Claypool Publishers, 2015. [351]Google Scholar

Cicirello, V. A. and Smith, S. F.. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In AAAI, pages 1355–1361, 2005. [365]Google Scholar

Cohen, A. and Hazan, T.. Following the perturbed leader for online structured learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1034–1042, Lille, France, 07–09 Jul 2015. JMLR.org. [328, 330]Google Scholar

Cohen, A., Hazan, T., and Koren, T.. Tight bounds for bandit combinatorial optimization. In Proceedings of the 2017 Conference on Learning Theory, pages 629–642, Amsterdam, Netherlands, 2017. JMLR.org. [326]Google Scholar

Combes, R., Magureanu, S., Proutiere, A., and Laroche, C.. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231–244. ACM, 2015a. ISBN 978-1-4503-3486-0. [351]Google Scholar

Combes, R., Shahi, M., Proutiere, A., and Lelarge, M.. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages 2116–2124. Curran Associates, Inc., 2015b. [327, 328]Google Scholar

Combes, R., Magureanu, S., and Proutière, A.. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1761–1769, 2017. [213, 215, 264, 314]Google Scholar

Conn, A. R., Scheinberg, K., and Vicente, L. N.. Introduction to Derivative-Free Optimization. SIAM, 2009. [364]CrossRef Google Scholar

Cover, T. M.. Universal portfolios. Mathematical Finance, 1(1):1–29, 1991. [284]Google Scholar

Cover, T. M. and Thomas, J. A.. Elements of information theory. John Wiley & Sons, 2012. [167]Google Scholar

Cowan, W. and Katehakis, M. N.. An asymptotically optimal policy for uniform bandits of unknown support. arXiv:1505.01918, 2015. [181]Google Scholar

Cowan, W., Honda, J., and Katehakis, M. N.. Normal bandits of unknown means and variances. Journal of Machine Learning Research, 18(154):1–28, 2018. [181]Google Scholar

Crammer, K. and Gentile, C.. Multiclass classification with bandit feedback using adaptive regularization. Machine learning, 90(3):347–383, 2013. [249]Google Scholar

Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B.. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87–94. ACM, 2008. [351]Google Scholar

Dani, V. and Hayes, T. P.. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 937–943, 2006. [328]Google Scholar

Dani, V., Hayes, T. P., and Kakade, S. M.. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Conference on Learning Theory, pages 355–366, 2008. [213, 257]Google Scholar

de la Peña, V. H. T.L. Lai, , and Shao, Q.. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. [66, 71, 226]Google Scholar

Degenne, R. and Koolen, W. M.. Pure exploration with multiple correct answers. In Advances in Neural Information Processing Systems, pages 14591–14600. Curran Associates, Inc., 2019. [364]Google Scholar

Degenne, R. and Perchet, V.. Anytime optimal algorithms in stochastic multi-armed bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1587–1595, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [108]Google Scholar

Degenne, R., Koolen, W. M., and Ménard, P.. Non-asymptotic pure exploration by solving games. In Advances in Neural Information Processing Systems, pages 14492–14501. Curran Associates, Inc., 2019. [364]Google Scholar

Dekel, O., Gentile, C., and Sridharan, K.. Robust selective sampling from single and multiple teachers. In Proceedings of the 23rd Conference on Learning Theory, pages 346–358, 2010. [249]Google Scholar

Dekel, O., Gentile, C., and Sridharan, K.. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655–2697, 2012. [249]Google Scholar

Dembo, A. and Zeitouni, O.. Large deviations techniques and applications, volume 38. Springer Science & Business Media, 2009. [68]Google Scholar

Denardo, E. V., Park, H., and Rothblum, U. G.. Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research, 32(2):374–394, 2007. [56]Google Scholar

Desautels, T., Krause, A., and Burdick, J. W.. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:4053–4103, 2014. [316]Google Scholar

Dobrushin, R. L.. Eine allgemeine formulierung des fundamentalsatzes von shannon in der informationstheorie. Usp. Mat. Nauk, 14(6(90)):3–104, 1959. [167]Google Scholar

Dong, S. and Van Roy, B.. An information-theoretic analysis for Thompson sampling with many actions. In Advances in Neural Information Processing Systems, Red Hook, NY, USA, 2018. Curran Associates Inc. [415]Google Scholar

Doob, J. L.. Stochastic processes. Wiley, 1953. [43]Google Scholar

Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T.. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pages 169–178. AUAI Press, 2011. [202]Google Scholar

Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M.. Contextual dueling bandits. In Proceedings of the 28th Conference on Learning Theory, pages 563–587, Paris, France, 2015. JMLR.org. [315]Google Scholar

Dudley, R. M.. Uniform central limit theorems, volume 142. Cambridge University Press, 2014. [66, 300]Google Scholar

Esseen, C. G.. On the Liapounoff limit of error in the theory of probability. Almqvist & Wiksell, 1942. [64]Google Scholar

Even-Dar, E., Mannor, S., and Mansour, Y.. PAC bounds for multi-armed bandit and Markov decision processes. In Computational Learning Theory, pages 255–270. Springer, 2002. [364, 368]Google Scholar

Even-Dar, E., Kakade, S. M., and Mansour, Y.. Experts in a Markov decision process. In Advances in Neural Information Processing Systems, pages 401–408, Cambridge, MA, USA, 2004. MIT Press. [475]Google Scholar

Even-Dar, E., Mannor, S., and Mansour, Y.. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research,7: 1079–1105, 2006. [364]Google Scholar

Fedorov, V. V.. Theory of optimal experiments. Academic Press, New York, 1972. [235]Google Scholar

Filippi, S., Cappé, O., Garivier, A., and Szepesvári, Cs.. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594. Curran Associates, Inc., 2010. [213]Google Scholar

Fink, D.. A compendium of conjugate priors, 1997. [382]Google Scholar

Foster, D. and Rakhlin, A.. No internal regret via neighborhood watch. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 382–390, La Palma, Canary Islands, 2012. JMLR.org. [448]Google Scholar

Foster, D. J. and Rakhlin, A.. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. arXiv:2002.04926, 2020. [256]Google Scholar

Frank, M. and Wolfe, P.. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956. [235]Google Scholar

Frederick, S., Loewenstein, G., and O’donoghue, T.. Time discounting and time preference: A critical review. Journal of Economic Literature, 40(2):351–401, 2002. [400]Google Scholar

Frostig, E. and Weiss, G.. Four proofs of Gittins’ multiarmed bandit theorem. Applied Probability Trust, 70, 1999. [401]Google Scholar

Fruit, R., Pirotta, M., and Lazaric, A.. Near optimal exploration-exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2997–3007, 2018. [474, 475, 482, 483]Google Scholar

Gai, Y., Krishnamachari, B., and Jain, R.. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012. [328]Google Scholar

Gajane, P., Ortner, R., and Auer, P.. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv:1805.10066, 2018. [338]Google Scholar

Garivier, A.. Informational confidence bounds for self-normalized averages and applications. arXiv:1309.3376, 2013. [93, 108]Google Scholar

Garivier, A. and Cappé, O.. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Conference on Learning Theory, 2011. [118, 119]Google Scholar

Gazrivier, A. and Kaufmann, E.. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, pages 998–1027, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar

Garivier, A. and Moulines, E.. On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, pages 174–188, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. [338]Google Scholar

Garivier, A. E. Kaufmann, , and Koolen, W. M.. Maximin action identification: A new bandit framework for games. In Proceedings of the 29th Conference on Learning Theory, pages 1028–1050, New York, NY, USA, 2016a. JMLR.org. [364]Google Scholar

Garivier, A., Lattimore, T., and Kaufmann, E.. On explore-then-commit strategies. In Advances in Neural Information Processing Systems, pages 784–792. Curran Associates, Inc., 2016b. [79, 100, 181]Google Scholar

Garivier, A., Ménard, P., and Stoltz, G.. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377–399, 2019. [181]Google Scholar

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., and Rubin, D.B.. Bayesian data analysis, volume 2. CRC Press Boca Raton, FL, 2014. [382]Google Scholar

Gentile, C. and Orabona, F.. On multilabel classification and ranking with partial feedback. In Advances in Neural Information Processing Systems, pages 1151–1159. Curran Associates, Inc., 2012. [249]Google Scholar

Gentile, C. and Orabona, F.. On multilabel classification and ranking with bandit feedback. Journal of Machine Learning Research, 15(1):2451–2487, 2014. [249]Google Scholar

Gerchinovitz, S.. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729–769, 2013. [249]Google Scholar

Gerchinovitz, S. and Lattimore, T.. Refined lower bounds for adversarial bandits. In Advances in Neural Information Processing Systems, pages 1198–1206. Curran Associates, Inc., 2016. [174, 190]Google Scholar

Ghosal, S. and van der Vaart, A.. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017. [382]Google Scholar

Ghosh, A., Chowdhury, S. R., and Gopalan, A.. Misspecified linear bandits. In 31st AAAI Conference on Artificial Intelligence, 2017. [238]Google Scholar

Gittins, J.. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2):148–177, 1979. [337, 401]Google Scholar

Gittins, J., Glazebrook, K., and Weber, R.. Multi-armed bandit allocation indices. John Wiley & Sons, 2011. [11, 337, 401]Google Scholar

Glowacka, D.. Bandit algorithms in information retrieval. Foundations and Trends® in Information Retrieval, 13:299–424, 01 2019. [351]Google Scholar

Glynn, P. and Juneja, S.. Ordinal optimization – empirical large deviations rate estimators, and stochastic multi-armed bandits. arXiv:1507.04564, 2015. [365]Google Scholar

Goldsman, D.. Ranking and selection in simulation. In 15th conference on Winter Simulation, pages 387–394, 1983. [365]Google Scholar

Gopalan, A. and Mannor, S.. Thompson sampling for learning parameterized Markov decision processes. In Proceedings of the 28th Conference on Learning Theory, pages 861–898, Paris, France, 2015. JMLR.org. [417]Google Scholar

Gordon, G. J.. Regret bounds for prediction problems. In Proceedings of the 12th Conference on Learning Theory, pages 29–40, 1999. [301]Google Scholar

Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R.. Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning, pages 13–20, USA, 2010. Omnipress. [416]Google Scholar

Granmo, O.. Solving two-armed bernoulli bandit problems using a Bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207–234, 2010. [416]Google Scholar

Gray, R. M.. Entropy and information theory. Springer Science & Business Media, 2011. [167]Google Scholar

Greenewald, K., Tewari, A., Murphy, S., and Klasnja, P.. Action centered contextual bandits. In Advances in Neural Information Processing Systems, pages 5977–5985. Curran Associates, Inc., 2017. [11]Google Scholar

Grötschel, M., Lovász, L., and Schrijver, A.. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012. [235, 327, 474]Google Scholar

Guo, F., Liu, C., and Wang, Y. M.. Efficient multiple-click models in web search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pages 124–131. ACM, 2009. [351]Google Scholar

György, A. and Szepesvári, Cs.. Shifting regret, mirror descent, and matrices. In Proceedings of the 33rd International Conference on Machine Learning, pages 2943–2951, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [338]Google Scholar

György, A., Linder, T., Lugosi, G., and Ottucsák, G.. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369–2403, 2007. [328, 329]Google Scholar

György, A., Pál, D., and Szepesvári, Cs.. Online learning: Algorithms for Big Data. 2019. [338]Google Scholar

Halmos, P. R.. Measure Theory. Graduate Texts in Mathematics. Springer New York, 1976. [42]Google Scholar

Hamidi, N. and Bayati, M.. A general framework to analyze stochastic linear bandit. arXiv:2002.05152, 2020. [415]Google Scholar

Hanawal, M., Saligrama, V., Valko, M., and Munos, R.. Cheap bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 2133–2142, Lille, France, 07–09 Jul 2015. JMLR.org. [315]Google Scholar

Hannan, J.. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3: 97–139, 1957. [125, 301, 328]Google Scholar

Hao, B., Lattimore, T., and Cs. Szepesvári. Adaptive exploration in linear contextual bandit. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020. [213, 264]Google Scholar

Hardy, G. H.. Divergent Series. Oxford University Press, 1973. [472]Google Scholar

Hazan, E.. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2 (3-4):157–325, 2016. [300, 301]Google Scholar

Hazan, E. and Kale, S.. A simple multi-armed bandit algorithm with optimal variation-bounded regret. In Proceedings of the 24th Conference on Learning Theory, pages 817–820. JMLR.org, 2011. [148]Google Scholar

Hazan, E., Karnin, Z., and Meka, R.. Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 17(119):1–34, 2016. [235, 284]Google Scholar

Helmbold, D. P., Littlestone, N., and Long, P. M.. Apple tasting. Information and Computation, 161(2):85–139, 2000. [448]Google Scholar

Herbster, M. and Warmuth, M. K.. Tracking the best expert. Machine Learning, 32(2):151–178, 1998. [338]Google Scholar

Herbster, M. and Warmuth, M. K.. Tracking the best linear predictor. Journal of Machine Learning Research, 1(Sep):281–309, 2001. [338]Google Scholar

Ho, Y-C., Sreenivas, R. S., and Vakili, P.. Ordinal optimization of DEDS. Discrete Event Dynamic Systems, 1992. [365]Google Scholar

Honda, J. and Takemura, A.. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 23rd Conference on Learning Theory, pages 67–79, 2010. [100, 109, 119, 181]Google Scholar

Honda, J. and Takemura, A.. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning, 85(3):361–391, 2011. [100]Google Scholar

Honda, J. and Takemura, A.. Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pages 375–383, Reykjavik, Iceland, 2014. JMLR.org. [417]Google Scholar

Honda, J. and Takemura, A.. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Journal of Machine Learning Research, 16:3721–3756, 2015. [119, 181]Google Scholar

Hu, X., Prashanth, L.A., György, A., and Szepesvári, Cs.. (Bandit) convex optimization with biased noisy gradient oracles. In AISTATS, pages 819–828, 2016. [315, 364]Google Scholar

Huang, R., Ajallooeian, M. M., Szepesvári, Cs., and Müller, M.. Structured best arm identification with fixed confidence. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 593–616, Kyoto, Japan, 2017a. JMLR.org. [364]Google Scholar

Huang, R., Lattimore, T., György, A., and Szepesvári, Cs.. Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities. Journal of Machine Learning Research, 18:1–31, 2017b. [300]Google Scholar

Huang, W., Ok, J., Li, L., and Chen, W.. Combinatorial pure exploration with continuous and separable reward functions and its applications. In IJCAI, pages 2291–2297, 2018. [364]Google Scholar

Hutter, M.. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004. [381]Google Scholar

Hutter, M. and Poland, J.. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6:639–660, 2005. [328]Google Scholar

Ionides, E. L.. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311, 2008. [149]Google Scholar

Ivanenko, V. I. and Labkovsky, V. A.. On regularities of mass random phenomena. arXiv:1204.4440, 2013. [125]Google Scholar

Jaksch, T., Auer, P., and Ortner, R.. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563–1600, August 2010. ISSN 1532-4435. [474, 476]Google Scholar

Jamieson, K. and Nowak, R.. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014. [364]Google Scholar

Jamieson, K. and Talwalkar, A.. Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 240–248, 2016. [365]Google Scholar

Jamieson, K., Katariya, S., Deshpande, A., and Nowak, R.. Sparse dueling bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 416–424, San Diego, CA, USA, 2015. JMLR.org. [315]Google Scholar

Jaynes, E. T.. Probability theory: the logic of science. Cambridge University Press, 2003. [381, 382]Google Scholar

Jefferson, A., Bortolotti, L., and Kuzmanovic, B.. What is unrealistic optimism? Consciousness and Cognition, 50:3–11, 2017. [92]Google Scholar

Joulani, P., György, A., and Szepesvári, Cs.. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pages 1453–1461, Atlanta, GA, USA, 2013. JMLR.org. [316]Google Scholar

Joulani, P., György, A., and Szepesvári, Cs.. A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, and variational bounds. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 681–720, Kyoto University, Kyoto, Japan, 2017. JMLR.org. [298]Google Scholar

Jun, K., Bhargava, A., Nowak, R., and Willett, R.. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99–109. Curran Associates, Inc., 2017. [213]Google Scholar

Kaelbling, L. P.. Learning in embedded systems. MIT Press, 1993. [92]Google Scholar

Kahneman, D. and Tversky, A.. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–91, 1979. [54]Google Scholar

Kakade, S.. On The Sample Complexity Of Reinforcement Learning. PhD thesis, University College London, 2003. [475]Google Scholar

Kakade, S. M., Shalev-Shwartz, S., and Tewari, A.. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning, pages 440–447, 2008. [202]Google Scholar

Kalai, A. and Vempala, S.. Geometric algorithms for online optimization. Technical Report MIT-LCS-TR-861, MIT, 2002. [301, 328]Google Scholar

Kalai, A. and Vempala, S.. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. [328]Google Scholar

Kallenberg, L.. A note on M.N. Katehakis’ and Y.-R. Chen’s computation of the Gittins index. Mathematics of Operations Research, 11(1):184–186, 1986. [402]Google Scholar

Kallenberg, L.. Markov decision processes: Lecture notes, 2016. [474]Google Scholar

Kallenberg, O.. Foundations of modern probability. Springer-Verlag, 2002. [32, 33, 41, 42, 43, 168, 228, 383]Google Scholar

Karnin, Z., Koren, T., and Somekh, O.. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, pages 1238–1246, Atlanta, GA, USA, 2013. JMLR.org. [364]Google Scholar

El Karoui, N. and Karatzas, I.. Dynamic allocation problems in continuous time. The Annals of Applied Probability, pages 255–286, 1994. [402]Google Scholar

Katariya, S., Kveton, B., Szepesvári, Cs., and Wen, Z.. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33rd International Conference on Machine Learning, pages 1215–1224, 2016. [351]Google Scholar

Katariya, S., Kveton, B., Szepesvári, Cs., Vernade, C., and Wen, Z.. Bernoulli rank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017a. [351]Google Scholar

Katariya, S., Kveton, B., Szepesvári, Cs., Vernade, C., and Wen, Z.. Stochastic rank-1 bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017b. [351]Google Scholar

Katehakis, M. N. and Robbins, H.. Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19):8584, 1995. [92, 100]Google Scholar

Ya Katkovnik, V. and Kulchitsky, Yu. Convergence of a class of random search algorithms. Automation Remote Control, 8:1321–1326, 1972. [364]Google Scholar

Kaufmann, E.. On Bayesian index policies for sequential resource allocation. The Annals of Statistics, 46(2):842–865, 04 2018. [100, 109, 415, 417]Google Scholar

Kaufmann, E., Cappé, O., and Garivier, A.. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 592–600, La Palma, Canary Islands, 2012a. JMLR.org. [415, 417]Google Scholar

Kaufmann, E., Korda, N., and Munos, R.. Thompson sampling: An asymptotically optimal finite-time analysis. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 199–213. Springer Berlin Heidelberg, 2012b. ISBN 978-3-642-34105-2. [100, 415, 417]Google Scholar

Kawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S.. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems, pages 1297–1305. Curran Associates, Inc., 2015. [417]Google Scholar

Kazerouni, A., Ghavamzadeh, M., Abbasi, Y., and Van Roy, B.. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919. Curran Associates, Inc., 2017. [315]Google Scholar

Kearns, M. and Saul, L.. Large deviation methods for approximate probabilistic inference. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, page 319. Morgan Kaufmann Publishers Inc., 1998. [69]Google Scholar

Kearns, M. and Singh, S.. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002. [475]Google Scholar

Kearns, M. J. and Vazirani, U. V.. An introduction to computational learning theory. MIT Press, 1994. [202]Google Scholar

Kiefer, J. and Wolfowitz, J.. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(5):363–365, 1960. [235]Google Scholar

Kim, G-S. and Paik, M. C.. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, pages 5877–5887. Curran Associates, Inc., 2019. [249]Google Scholar

Kim, M. J.. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):6415–6422, 2017. [417]Google Scholar

Kirschner, J. and Krause, A.. Information directed sampling and bandits with heteroscedastic noise. In Proceedings of the 31st Conference On Learning Theory, pages 358–384. PMLR, 06–09 Jul 2018. [72, 213]Google Scholar

Kirschner, J., Lattimore, T., and Krause, A.. Information directed sampling for linear partial monitoring. arXiv preprint arXiv:2002.11182, 2020. [448]Google Scholar

Kleinberg, R.. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697–704. MIT Press, 2005. [314]Google Scholar

Kleinberg, R., Slivkins, A., and Upfal, E.. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690. ACM, 2008. [314]Google Scholar

Kocák, T., Neu, G., Valko, M., and Munos, R.. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems,pages 613–621. Curran Associates, Inc., 2014. [148, 149, 316]Google Scholar

Kocák, T., Valko, M., Munos, R., and Agrawal, S.. Spectral Thompson sampling. In AAAI, pages 1911–1917, 2014. [417]Google Scholar

Kocsis, L. and Szepesvári, Cs.. Discounted UCB. In 2nd PASCAL Challenges Workshop, pages 784–791, 2006. [11, 338]Google Scholar

Komiya, H.. Elementary proof for Sion’s minimax theorem. Kodai Mathematical Journal, 11(1):5–7, 1988. [301]Google Scholar

Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H.. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of the 28th Conference on Learning Theory, pages 1141–1154, Paris, France, 2015a. JMLR.org. [315]Google Scholar

Komiyama, J., Honda, J., and Nakagawa, H.. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems, pages 1792–1800. Curran Associates, Inc., 2015b. [448]Google Scholar

Koolen, W. M., Warmuth, M. K., and Kivinen, J.. Hedging structured concepts. In Proceedings of the 23rd Conference on Learning Theory, pages 93–105. Omnipress, 2010. [328]Google Scholar

Korda, N., Kaufmann, E., and Munos, R.. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pages 1448–1456. Curran Associates, Inc., 2013. [100, 120, 415, 417]Google Scholar

Kujala, J. and Elomaa, T.. On following the perturbed leader in the bandit setting. In Proceedings of the 16th International Conference on Algorithmic Learning Theory, pages 371–385, 2005. [328]Google Scholar

Kujala, J. and Elomaa, T.. Following the perturbed leader to gamble at multi-armed bandits. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages 166–180. Springer, 2007. [328]Google Scholar

Kulkarni, S. R. and Lugosi, G.. Finite-time lower bounds for the two-armed bandit problem. IEEE Transactions on Automatic Control, 45(4):711–714, 2000. [181]Google Scholar

Kveton, B., Szepesvári, Cs., Wen, Z., and Ashkan, A.. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, pages 767–776. JMLR.org, 2015a. [351]Google Scholar

Kveton, B., Wen, Z., Ashkan, A., and Szepesvári, Cs.. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 535–543, San Diego, CA, USA, 2015b. JMLR.org. [328]Google Scholar

Kveton, B., Wen, Z., Ashkan, Z., and Szepesvári, Cs.. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, pages 1450–1458. Curran Associates Inc., 2015c. [351]Google Scholar

Kveton, B., Szepesvári, Cs., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M.. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In Proceedings of the 36th International Conference on Machine Learning, pages 3601–3610, Long Beach, California, USA, 09–15 Jun 2019. PMLR. [417]Google Scholar

Lagree, P., Vernade, C., and Cappé, O.. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems, pages 1597–1605. Curran Associates Inc., 2016. [351]Google Scholar

Lai, T. L.. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, pages 1091–1114, 1987. [92, 100, 109, 119, 401]Google Scholar

Lai, T. L.. Martingales in sequential analysis and time series, 1945–1985. Electronic Journal for history of probability and statistics, 5(1), 2009. [226]Google Scholar

Lai, T. L. and Graves, T.. Asymptotically efficient adaptive choice of control laws in controlled Markov chains. SIAM Journal on Control and Optimization, 35(3):715–743, 1997. [475]Google Scholar

Lai, T. L. and Robbins, H.. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985. [56, 92, 100, 119, 181, 230]Google Scholar

Langford, J. and Zhang, T.. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pages 817–824. Curran Associates, Inc., 2008. [204]Google Scholar

Laplace, P.. Pierre-Simon Laplace Philosophical Essay on Probabilities: Translated from the fifth French edition of 1825 With Notes by the Translator, volume 13. Springer Science & Business Media, 2012. [33]Google Scholar

Lattimore, T.. The Pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, pages 208–216. Curran Associates, Inc., 2015a. [136, 257]Google Scholar

Lattimore, T.. Optimally confident UCB: Improved regret for finite-armed bandits. arXiv:1507.07880, 2015b. [108]Google Scholar

Lattimore, T.. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of the 29th Annual Conference on Learning Theory, pages 1214–1245, New York, NY, USA, 2016a. JMLR.org. [100, 401]Google Scholar

Lattimore, T.. Regret analysis of the anytime optimally confident ucb algorithm. arXiv:1603.08661, 2016b. [108]Google Scholar

Lattimore, T.. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of the 29th Conference on Learning Theory, pages 1214–1245, 2016c. [400]Google Scholar

Lattimore, T.. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in Neural Information Processing Systems, pages 1584–1593. Curran Associates, Inc., 2017. [96, 181]Google Scholar

Lattimore, T.. Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research, 2018. [82, 108, 110, 181]Google Scholar

Lattimore, T. and Hutter, M.. PAC bounds for discounted MDPs. In Proceedings of the 23th International Conference on Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 320–334. Springer Berlin / Heidelberg, 2012. [475]Google Scholar

Lattimore, T. and Munos, R.. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages 550–558. Curran Associates, Inc., 2014. [214]Google Scholar

Lattimore, T. and Szepesvári, Cs.. The end of optimism? An asymptotic analysis of finite-armed linear bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 728–737, Fort Lauderdale, FL, USA, 2017. JMLR.org. [213, 264]Google Scholar

Lattimore, T. and Szepesvári, Cs.. Cleaning up the neighbourhood: A full classification for adversarial partial monitoring. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, 2019a. [447, 448, 450]Google Scholar

Lattimore, T. and Szepesvári, Cs.. Learning with good feature representations in bandits and in RL with a generative model. arXiv:1911.07676, 2019b. [238, 256]Google Scholar

Lattimore, T. and Szepesvári, Cs.. An information-theoretic approach to minimax regret in partial monitoring. In Proceedings of the 32nd Conference on Learning Theory, pages 2111–2139, Phoenix, USA, 2019c. PMLR. [416, 417, 419, 448]Google Scholar

Lattimore, T. and Szepesvári, Cs.. Exploration by optimisation in partial monitoring. arXiv:1907.05772, 2019d. [446, 448, 451]Google Scholar

Lattimore, T., Crammer, K., and Szepesvári, Cs.. Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964–972. Curran Associates, Inc., 2015. [249]Google Scholar

Lattimore, T., Kveton, B., Li, S., and Szepesvári, Cs.. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3949–3958. Curran Associates, Inc., 2018. [230, 330, 351]Google Scholar

Laurent, B. and Massart, P.. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000. [66]Google Scholar

Lazaric, A. and Munos, R.. Hybrid stochastic-adversarial on-line learning. In Proceedings of the 22nd Conference on Learning Theory, 2009. [202]Google Scholar

Le, T., Szepesvári, Cs., and Zheng, R.. Sequential learning for multi-channel wireless network monitoring with channel switching costs. IEEE Transactions on Signal Processing, 62(22):5919–5929, 2014. [11]Google Scholar

Le Cam, L. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, 1 (1):38–53, 1973. [174]Google Scholar

Lee, Y. T., Sidford, A., and Vempala, S. S.. Efficient convex optimization with membership oracles. In Proceedings of the 31st Conference On Learning Theory, pages 1292–1294. JMLR.org, 06–09 Jul 2018. [327]Google Scholar

Lehmann, E. L. and Casella, G.. Theory of point estimation. Springer Science & Business Media, 2006. [382]Google Scholar

Lei, H., Tewari, A., and Murphy, S. A.. An actor-critic contextual bandit algorithm for personalized mobile health interventions. arXiv:1706.09090, 2017. [11]Google Scholar

Leike, J., Lattimore, T., Orseau, L., and Hutter, M.. Thompson sampling is asymptotically optimal in general environments. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pages 417–426. AUAI Press, 2016. [417]Google Scholar

Lerche, H. R.. Boundary crossing of Brownian motion: Its relation to the law of the iterated logarithm and to sequential analysis. Springer, 1986. [110]Google Scholar

Levin, D. A. and Peres, Y.. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. [42]Google Scholar

Levin, L. A.. On the notion of a random sequence. Soviet Mathematics Doklady, 14(5):1413–1416, 1973. [125]Google Scholar

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A.. Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185): 1–52, 2018. [365]Google Scholar

Li, S., Wang, B., Zhang, S., and Chen, W.. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016. [351]Google Scholar

Li, S., Lattimore, T., and Szepesvári, Cs.. Online learning to rank with features. In Proceedings of the 36th International Conference on Machine Learning, pages 3856–3865, Long Beach, California, USA, 09–15 2019a. PMLR. [350]Google Scholar

Li, Y., Wang, Y., and Zhou, Y.. Nearly minimax-optimal regret for linearly parameterized bandits. In Proceedings of the 32nd Conference on Learning Theory, pages 2173–2174, Phoenix, USA, 2019b. JMLR.org. [212]Google Scholar

Liang, T., Narayanan, H., and Rakhlin, A.. On zeroth-order stochastic convex optimization via random walks. arXiv:1402.2667, 2014. [364]Google Scholar

Lin, T., Abrahao, B., Kleinberg, R., Lui, J., and Chen, W.. Combinatorial partial monitoring game with linear feedback and its applications. In Proceedings of the 31st International Conference on Machine Learning, pages 901–909, Bejing, China, 22–24 Jun 2014. PMLR. [448]Google Scholar

Lin, T., Li, J., and Chen, W.. Stochastic online greedy learning with semi-bandit feedbacks. In Advances in Neural Information Processing Systems, pages 352–360. Curran Associates, Inc., 2015. [328]Google Scholar

Littlestone, N. and Warmuth, M. K.. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994. [125, 137]Google Scholar

Lovász, L. and Vempala, S.. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007. [284]Google Scholar

Luo, H., Wei, C-Y., Agarwal, A., and Langford, J.. Efficient contextual bandits in non-stationary worlds. In Proceedings of the 31st Conference On Learning Theory, pages 1739–1776. JMLR.org, 06–09 Jul 2018. [338]Google Scholar

MacKay, D.. Information theory, inference and learning algorithms. Cambridge University Press, 2003. [167]Google Scholar

Magureanu, S., Combes, R., and Proutière, A.. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of the 27th Conference on Learning Theory, pages 975–999, 2014. [215, 314]Google Scholar

Maillard, O.. Robust risk-averse stochastic multi-armed bandits. In Proceedings of the 24th International Conference on Algorithmic Learning Theory, pages 218–233. Springer, Berlin, Heidelberg, 2013. [56]Google Scholar

Maillard, O., Munos, R., and Stoltz, G.. Finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Proceedings of the 24th Conference on Learning Theory, 2011. [119]Google Scholar

Mannor, S. and Shamir, O.. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692. Curran Associates, Inc., 2011. [316, 448]Google Scholar

Mannor, S. and Shimkin, N.. On-line learning with imperfect monitoring. In Learning Theory and Kernel Machines, pages 552–566. Springer, 2003. [448]Google Scholar

Mannor, S. and Tsitsiklis, J. N.. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623–648, December 2004. [364]Google Scholar

Mannor, S., Perchet, V., and Stoltz, G.. Set-valued approachability and online learning with partial monitoring. The Journal of Machine Learning Research, 15(1):3247 – 3295, 2014. [448]Google Scholar

Markowitz, H.. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952. [55]Google Scholar

Maron, M. E. and Kuhns, J. L.. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216–244, 1960. [352]Google Scholar

Martin-Löf, P.. The definition of random sequences. Information and Control, 9(6):602–619, 1966. [125]Google Scholar

Maurer, A. and Pontil, M.. Empirical Bernstein bounds and sample variance penalization. arXiv:0907.3740, 2009. [70, 95]Google Scholar

May, B. C., Korda, N., Lee, A., and Leslie, D. S.. Optimistic Bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):2069–2106, 2012. [416]Google Scholar

McDiarmid, C.. Concentration. In Probabilistic methods for algorithmic discrete mathematics, pages 195–248. Springer, 1998. [66, 71, 228]Google Scholar

McMahan, H. B. and Blum, A.. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Conference on Learning Theory, volume 3120, pages 109–123. Springer, 2004. [328]Google Scholar

McMahan, H. B. and Streeter, M. J.. Tighter bounds for multi-armed bandits with expert advice. In Proceedings of the 22nd Conference on Learning Theory, 2009. [201]Google Scholar

Ménard, P. and Garivier, A.. A minimax and asymptotically optimal algorithm for stochastic bandits. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 223–237, Kyoto University, Kyoto, Japan, 15–17 Oct 2017. JMLR.org. [100, 108, 119]Google Scholar

Meyn, S. P. and Tweedie, R. L.. Markov chains and stochastic stability. Springer Science & Business Media, 2012. [41, 42]Google Scholar

Mnih, V., Szepesvári, Cs., and Audibert, J.-Y.. Empirical Bernstein stopping. In Proceedings of the 25th International Conference on Machine Learning, pages 672–679, New York, NY, USA, 2008. ACM. [70, 95]Google Scholar

Mukherjee, S., Naveen, KP., Sudarsanam, N., and Ravindran, B.. Efficient-UCBV: An almost optimal algorithm using variance estimates. In 32nd AAAI Conference on Artificial Intelligence, 2018. [108]Google Scholar

Nelder, J. A. and Wedderburn, R. W. M.. Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3):370–384, 1972. [213]Google Scholar

Nemirovsky, A. S.. Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody, 15, 1979. [301]Google Scholar

Nemirovsky, A. S. and Yudin, D. B.. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. [301, 364, 365]Google Scholar

Neu, G.. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176. Curran Associates, Inc., 2015a. [148, 149, 201, 328]Google Scholar

Neu, G.. First-order regret bounds for combinatorial semi-bandits. In Proceedings of the 28th Conference on Learning Theory, pages 1360–1375, Paris, France, 2015b. JMLR.org. [148, 299]Google Scholar

Neu, G., György, A., Szepesvári, Cs., and Antos, A.. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691, December 2014. [475]Google Scholar

Von Neumann, J. and Morgenstern, O.. Theory of Games and Economic Behavior. Princeton University Press, Princeton, 1944. [55]Google Scholar

Niño-Mora, J.. Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing, 23(2):254–267, 2011. [402]Google Scholar

O’Donoghue, B., Chu, E., Parikh, N., and Boyd, S.. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042 – 1068, 2016. [450]Google Scholar

O’Donoghue, B., Chu, E., Parikh, N., and Boyd, S.. SCS: Splitting conic solver, version 2.1.1. https://github.com/cvxgrp/scs, November 2017. [450]Google Scholar

Ok, J., Proutiere, A., and Tranos, D.. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, Red Hook, NY, USA, 2018. Curran Associates Inc. [213, 264]Google Scholar

Ortega, P. A. and Braun, D. A.. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research, pages 475–511, 2010. [416]Google Scholar

Ortner, R. and Ryabko, D.. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763–1771, USA, 2012. Curran Associates Inc. [475]Google Scholar

Ortner, R., Ryabko, D., Auer, P., and Munos, R.. Regret bounds for restless Markov bandits. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pages 214–228, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. [337]Google Scholar

Osband, I. and Van Roy, B.. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning, pages 2701–2710, Sydney, Australia, 06–11 Aug 2017. JMLR.org. [475]Google Scholar

Osband, I., Russo, D., and Van Roy, B.. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011. Curran Associates, Inc., 2013. [417, 475]Google Scholar

Ostrovsky, E. and Sirota, L.. Exact value for subgaussian norm of centered indicator random variable. arXiv:1405.6749, 2014. [69]Google Scholar

Pandelis, D. G. and Teneketzis, D.. On the optimality of the gittins index rule for multi-armed bandits with multiple plays. Mathematical Methods of Operations Research, 50(3):449–461, 1999. [400]Google Scholar

Papadimitriou, C. H. and Tsitsiklis, J. N.. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987. [471]Google Scholar

Papadimitriou, C. H. and Vempala, S.. On the approximability of the traveling salesman problem. Combinatorica, 26(1):101–120, 2006. [328]Google Scholar

Perchet, V.. Approachability of convex sets in games with partial monitoring. Journal of Optimization Theory and Applications, 149(3):665–677, 2011. [448]Google Scholar

Perchet, V. and Rigollet, P.. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 04 2013. [215]Google Scholar

Peskirand, G. Shiryaev, A.. Optimal stopping and free-boundary problems. Springer, 2006. [43, 401, 402]Google Scholar

Piccolboni, A. and Schindelhauer, C.. Discrete prediction games with arbitrary feedback and loss. In Computational Learning Theory, pages 208–223. Springer, 2001. [448]Google Scholar

Pike-Burke, C., Agrawal, S., Szepesvári, Cs., and Grünewälder, S.. Bandits with delayed, aggregated anonymous feedback. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4102–4110. JMLR.org, 10–15 Jul 2018. [316]Google Scholar

Poland, J.. FPL analysis for adaptive bandits. In Lupanov, O. B., Kasim-Zade, O. M., Chaskin, A. V., and Steinhöfel, K., editors, Stochastic Algorithms: Foundations and Applications, pages 58–69, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [328]Google Scholar

Pollard, D.. A user’s guide to measure theoretic probability, volume 8. Cambridge University Press, 2002. [32]Google Scholar

Presman, E. L. and Sonin, I. N.. Sequential control with incomplete information. The Bayesian approach to multi-armed bandit problems. Academic Press, 1990. [11, 401]Google Scholar

Puterman, M.. Markov decision processes: discrete stochastic dynamic programming, volume 414. Wiley, 2009. [473, 474, 476]Google Scholar

Qin, C., Klabjan, D., and Russo, D.. Improving the expected improvement algorithm. In Advances in Neural Information Processing Systems, pages 5381–5391. Curran Associates, Inc., 2017. [364]Google Scholar

Radlinski, F., Kleinberg, R., and Joachims, T.. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791. ACM, 2008. [350, 351, 352]Google Scholar

Rafferty, A. N., Ying, H., and Williams, J. J.. Bandit assignment for educational experiments: Benefits to students versus statistical power. In Artificial Intelligence in Education, pages 286–290. Springer, 2018. [11]Google Scholar

Rakhlin, A. and Sridharan, K.. BISTRO: An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1977–1985, 2016. [202]Google Scholar

Rakhlin, A. and Sridharan, K.. On equivalence of martingale tail bounds and deterministic regret inequalities. In Proceedings of the 30th Conference on Learning Theory, pages 1704–1722, Amsterdam, Netherlands, 2017. JMLR.org. [249]Google Scholar

Rakhlin, A., Shamir, O., and Sridharan, K.. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012. [365]Google Scholar

Rios, L. M. and Sahinidis, N. V.. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):1247–1293, Jul 2013. [364]Google Scholar

Robbins, H.. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952. [10, 56, 78, 79]Google Scholar

Robbins, H. and Siegmund, D.. Boundary crossing probabilities for the wiener process and sample sums. The Annals of Mathematical Statistics, pages 1410–1429, 1970. [226]Google Scholar

Robbins, H. and Siegmund, D.. A class of stopping rules for testing parametric hypotheses. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, pages 37–41. University of California Press, 1972. [230]Google Scholar

Robbins, H., Sigmund, D., and Chow, Y.. Great expectations: the theory of optimal stopping. Houghton-Nifflin, 7:631–640, 1971. [401]Google Scholar

Robertson, S.. The probability ranking principle in IR. Journal of Documentation, 33(4):294–304, 1977. [352]Google Scholar

Rockafellar, R. T.. Convex analysis. Princeton university press, 2015. [275, 329]Google Scholar

Rockafellar, R. T. and Uryasev, S.. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000. [55]Google Scholar

Rogers, C. A.. Packing and covering. Cambridge University Press, 1964. [226]Google Scholar

Ross, S. M.. Introduction to Stochastic Dynamic Programming. Academic Press, New York, 1983. [474]Google Scholar

Rusmevichientong, P. and Tsitsiklis, J. N.. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010. [79, 213, 257]Google Scholar

Russo, D.. Simple Bayesian algorithms for best arm identification. In Proceedings of the 29th Annual Conference on Learning Theory, pages 1417–1418, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar

Russo, D. and Van Roy, B.. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264. Curran Associates, Inc., 2013. [214]Google Scholar

Russo, D. and Van Roy, B.. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 1583–1591. Curran Associates, Inc., 2014a. [213, 416, 417]Google Scholar

Russo, D. and Van Roy, B.. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014b. [417]Google Scholar

Russo, D. and Van Roy, B.. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(1):2442–2471, 2016. ISSN 1532-4435. [328, 415, 417]Google Scholar

Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z.. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018. [417]Google Scholar

Rustichini, A.. Minimizing regret: The general case. Games and Economic Behavior, 29(1):224–243, 1999. [447, 448]Google Scholar

Salomon, A., Audibert, J., and Alaoui, I.. Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem. Journal of Machine Learning Research, 14(Jan):187–207, 2013. [181]Google Scholar

Samuelson, P.. A note on measurement of utility. The Review of Economic Studies, 4(2):pp. 155–161, 1937. [400]Google Scholar

Sani, A., Lazaric, A., and Munos, R.. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283. Curran Associates, Inc., 2012. [56]Google Scholar

Seldin, Y. and Lugosi, G.. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Proceedings of the 2017 Conference on Learning Theory, pages 1743–1759, Amsterdam, Netherlands, 2017. JMLR.org. [136]Google Scholar

Seldin, Y. and Slivkins, A.. One practical algorithm for both stochastic and adversarial bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 1287–1295, Bejing, China, 2014. JMLR.org. [136]Google Scholar

Shalev-Shwartz, S.. Online learning: Theory, algorithms, and applications. PhD thesis, The Hebrew University of Jerusalem, 2007. [301]Google Scholar

Shalev-Shwartz, S.. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012. [300, 301]Google Scholar

Shalev-Shwartz, S. and Ben-David, S.. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [201, 202, 204]Google Scholar

Shalev-Shwartz, S. and Singer, Y.. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115–142, 2007. [301]Google Scholar

Shamir, O.. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of the 26th Conference on Learning Theory, pages 3–24. JMLR.org, 2013. [315, 364]Google Scholar

Shamir, O.. On the complexity of bandit linear optimization. In Proceedings of the 28th Conference on Learning Theory, pages 1523–1551, Paris, France, 2015. JMLR.org. [257, 311]Google Scholar

Sharot, T.. The optimism bias. Current Biology, 21(23):R941–R945, 2011a. [91, 92]Google Scholar

Sharot, T.. The optimism bias: A tour of the irrationally positive brain. Pantheon/Random House, 2011b. [92]Google Scholar

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G. J. Schrittwieser, I. Antonoglou, V. Panneershelvam, , and Lanctot, M.. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. [11]Google Scholar

Silvey, S. D. and Sibson, B.. Discussion of Dr. Wynn’s and of Dr. Laycock’s papers. Journal of Royal Statistical Society (B), 34:174–175, 1972. [235]Google Scholar

Sion, M.. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958. [300]Google Scholar

Slivkins, A.. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533–2568, 2014. [314]Google Scholar

Slivkins, A.. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12 (1-2):1–286, 2019. ISSN 1935-8237. [10, 314]Google Scholar

Slivkins, A. and Upfal, E.. Adapting to a changing environment: the Brownian restless bandits. In Proceedings of the 21st Conference on Learning Theory, pages 343–354, 2008. [338]Google Scholar

Soare, M., Lazaric, A., and Munos, R.. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836. Curran Associates, Inc., 2014. [264, 364]Google Scholar

Sonin, I. M.. A generalized Gittins index for a Markov chain and its recursive calculation. Statistics and Probability Letters, 78(12):1526–1533, 2008. [402]Google Scholar

Srebro, N., Sridharan, K., and Tewari, A.. On the universality of online mirror descent. In Advances in neural information processing systems, pages 2645–2653, 2011. [299]Google Scholar

Sridharan, K. and Tewari, A.. Convex games in banach spaces. In Proceedings of the 23rd Conference on Learning Theory, pages 1–13. Omnipress, 2010. [301]Google Scholar

Srinivas, N., Krause, A., Kakade, S., and Seeger, M.. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, page 1015–1022, Madison, WI, USA, 2010. Omnipress. [214]Google Scholar

Stoltz, G.. Incomplete information and internal regret in prediction of individual sequences. PhD thesis, Université Paris Sud-Paris XI, 2005. [137]Google Scholar

Strasser, H.. Mathematical theory of statistics: statistical experiments and asymptotic decision theory, volume 7. Walter de Gruyter, 2011. [382]Google Scholar

Strauch, R. E.. Negative dynamic programming. The Annals of Mathematical Statistics, 37(4):871–890, 08 1966. [476]Google Scholar

Streeter, M. J. and Smith, S. F.. A simple distribution-free approach to the max k-armed bandit problem. In International Conference on Principles and Practice of Constraint Programming, pages 560–574. Springer, 2006a. [365]Google Scholar

Streeter, M. J and Smith, S. F.. An asymptotically optimal algorithm for the max k-armed bandit problem. In Proceedings of the National Conference on Artificial Intelligence, pages 135–142, 2006b. [365]Google Scholar

Strehl, A. and Littman, M.. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd International Conference on Machine learning, pages 856–863, New York, NY, USA, 2005. ACM. [475]Google Scholar

Strehl, A. and Littman, M.. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [475, 481]Google Scholar

Strehl, A., Li, L., Wiewiora, E., Langford, J., and Littman, M.. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888, New York, NY, USA, 2006. ACM. [475]Google Scholar

Strens, M. J. A.. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 943–950, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [474]Google Scholar

Sui, Y., Gotovos, A., Burdick, J., and Krause, A.. Safe exploration for optimization with gaussian processes. In Proceedings of the 32nd International Conference on Machine Learning, pages 997–1005, Lille, France, 07–09 Jul 2015. JMLR.org. [315]Google Scholar

Sun, Q., Zhou, W., and Fan, J.. Adaptive huber regression: Optimality and phase transition. arXiv:1706.06991, 2017. [96]Google Scholar

Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, 1998. [79, 400]Google Scholar

Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018. [474]Google Scholar

Swart, J.M.. Large deviation theory, January 2017. URL http://staff.utia.cas.cz/swart/lecture_notes/LDP8.pdf. [68]Google Scholar

Syrgkanis, V., Krishnamurthy, A., and Schapire, R.. Efficient algorithms for adversarial contextual learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 2159–2168, New York, NY, USA, 2016. JMLR.org. [202]Google Scholar

Szepesvári, Cs.. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [474]Google Scholar

Szita, I. and L˝orincz, A.. Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. In Proceedings of the 26th International Conference on Machine Learning, pages 1001–1008, New York, USA, 2009. ACM. [475]Google Scholar

Szita, I. and Szepesvári, Cs.. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, pages 1031–1038, USA, 2010. Omnipress. [475]Google Scholar

Takimoto, E. and Warmuth, M. K.. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773–818, 2003. [329]Google Scholar

Talagrand, M.. The missing factor in Hoeffding’s inequalities. Annales de l’IHP Probabilités et Statistiques, 31(4):689–702, 1995. [65]Google Scholar

Taraldsen, G.. Optimal learning from the Doob-Dynkin lemma. arXiv:1801.00974, 2018. [30]Google Scholar

Teevan, J., Dumais, S. T., and Horvitz, E.. Characterizing the value of personalizing search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 757–758, New York, NY, USA, 2007. ACM. [352]Google Scholar

Tewari, A. and Bartlett, P. L.. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems, pages 1505–1512. Curran Associates, Inc., 2008. [474]Google Scholar

Tewari, A. and Murphy, S. A.. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017. [201]Google Scholar

Theocharous, G., Wen, Z., Abbasi-Yadkori, Y., and Vlassis, N.. Posterior sampling for large scale reinforcement learning. arXiv:1711.07979, 2017. [475]Google Scholar

Thompson, W.. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. [10, 55, 79, 404, 414, 416]Google Scholar

Thompson, W. R.. On the theory of apportionment. American Journal of Mathematics, 57(2):450–456, 1935. [473]Google Scholar

Todd, M. J.. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016. [235]Google Scholar

Tolkien, J. R. R.. The Hobbit. Ballantine Books, 1937. [404]Google Scholar

Tran-Thanh, L., Chapman, A., Munoz de Cote, E., Rogers, A., and Jennings, N. R.. Epsilon–first policies for budget–limited multi-armed bandits. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, AAAI, pages 1211–1216, 2010. [315]Google Scholar

Tran-Thanh, L., Chapman, A., Rogers, A., and Jennings, N. R.. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, AAAI’12, pages 1134–1140. AAAI Press, 2012. [315]Google Scholar

Tropp, J. A.. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015. [66]Google Scholar

Tsitsiklis, J. N.. A short proof of the Gittins index theorem. The Annals of Applied Probability, pages 194–199, 1994. [401]Google Scholar

Tsybakov, A. B.. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [167]Google Scholar

Ionescu Tulcea, C.. Mesures dans les espaces produits. Atti Accademia Nazionale Lincei Rend,7: 208–211, 1949–50. [43]Google Scholar

Uchibe, E. and Doya, K.. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004. [149]Google Scholar

Valko, M.. Bandits on graphs and structures, 2016. [217, 316]Google Scholar

Valko, M., Carpentier, A., and Munos, R.. Stochastic simultaneous optimistic optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 19–27, Atlanta, GA, USA, 2013a. JMLR.org. [364]Google Scholar

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N.. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, pages 654–663, Arlington, VA, USA, 2013b. AUAI Press. [214]Google Scholar

Valko, M., Munos, R., Kveton, B., and Kocák, T.. Spectral bandits for smooth graph functions. In Proceedings of the 31st International Conference on Machine Learning, pages 46–54, Bejing, China, 2014. JMLR.org. [214, 217, 238]Google Scholar

van de Geer, S.. Empirical Processes in M-estimation, volume 6. Cambridge University Press, 2000. [66, 226, 300]Google Scholar

van der Hoeven, D., van Erven, T., and Kotłowski, W.. The many faces of exponential weights in online learning. In Proceedings of the 31st Conference on Learning Theory, pages 2067–2092, 2018. [284]Google Scholar

van der Vaart, A. W. and Wellner, J. A.. Weak Convergence and Empirical Processes. Springer, New York, 1996. [300]Google Scholar

Vanchinathan, H. P., Bartók, G., and Krause, A.. Efficient partial monitoring with prior information. In Advances in Neural Information Processing Systems, pages 1691–1699. Curran Associates, Inc., 2014. [448]Google Scholar

Vapnik, V.. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998. [204]Google Scholar

Varaiya, P., Walrand, J., and Buyukkoc, C.. Extensions of the multiarmed bandit problem: The discounted case. IEEE Transactions on Automatic Control, 30(5):426–439, 1985. [402]Google Scholar

Vernade, C., Cappé, O., and Perchet, V.. Stochastic bandit models for delayed conversions. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017. [316]Google Scholar

Vernade, C., Carpentier, A., Zasppella, G., Ermis, B., and Brueckner, M.. Contextual bandits under delayed feedback. arXiv:1807.02089, 2018. [316]Google Scholar

Villar, S., Bowden, J., and Wason, J.. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199–215, 2015. [11]Google Scholar

Vogel, W.. An asymptotic minimax theorem for the two armed bandit problem. The Annals of Mathematical Statistics, 31(2):444–451, 1960. [174]Google Scholar

von Neumann, J.. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928. [301]Google Scholar

Vovk, V. G.. Aggregating strategies. Proceedings of Computational Learning Theory, 1990. [125, 137]Google Scholar

Wang, S. and Chen, W.. Thompson sampling for combinatorial semi-bandits. In Proceedings of the 35th International Conference on Machine Learning, pages 5114–5122, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. JMLR.org. [328, 417]Google Scholar

Wang, Y., Audibert, J-Y., and Munos, R.. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems, pages 1729–1736, 2009. [314]Google Scholar

Warmuth, M. K. and Jagota, A.. Continuous and discrete-time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic Proceedings of the 5th International Symposium on Artificial Intelligence and Mathematics, 1997. [301]Google Scholar

Wawrzynski, P. L and Pacut, A.. Truncated importance sampling for reinforcement learning with experience replay. In Proceedings of the International Multiconference on Computer Science and Information Technology, pages 305–315, 2007. [149]Google Scholar

Weber, R.. On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2(4): 1024–1033, 1992. [401]Google Scholar

Weber, R. and Weiss, G.. On an index policy for restless bandits. Journal of Applied Probability, 27 (3):637–648, 1990. [402]Google Scholar

Wei, C-Y. and Luo, H.. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory, pages 1263–1291. JMLR.org, 06–09 Jul 2018. [299, 301, 304]Google Scholar

Weinberger, M. J. and Ordentlich, E.. On delayed prediction of individual sequences. In Information Theory, 2002. Proceedings. 2002 IEEE International Symposium on, page 148. IEEE, 2002. [316]Google Scholar

Wen, Z., Kveton, B., and Ashkan, A.. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1113–1122, Lille, France, 2015. JMLR.org. [328]Google Scholar

Whittle, P.. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society (B), pages 143–149, 1980. [401]Google Scholar

Whittle, P.. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988. [337, 402]Google Scholar

Williams, D.. Probability with martingales. Cambridge University Press, 1991. [32]Google Scholar

Wu, H. and Liu, X.. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657. Curran Associates, Inc., 2016. [315]Google Scholar

Wu, Y., György, A., and Szepesvári, Cs.. Online learning with gaussian payoffs and side observations. In Advances in Neural Information Processing Systems, pages 1360–1368. Curran Associates Inc., 2015. [448]Google Scholar

Wu, Y., Shariff, R., Lattimore, T., and Szepesvári, Cs.. Conservative bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1254–1262, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [315]Google Scholar

Wynn, H. P.. The sequential generation of D-optimum experimental designs. The Annals of Mathematical Statistics, pages 1655–1664, 1970. [235]Google Scholar

Xia, Y., Li, H., Qin, T., Yu, N., and Liu, T.-Y.. Thompson sampling for budgeted multi-armed bandits. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI, pages 3960–3966. AAAI Press, 2015. [315]Google Scholar

Yao, Y.. Some results on the Gittins index for a normal reward process. In Time Series and Related Topics, pages 284–294. Institute of Mathematical Statistics, 2006. [402]Google Scholar

Yu, B.. Assouad, Fano, and Cam, Le. In D. Pollard, E. Torgersen, and G. L. Yang, editors, Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, pages 423–435. Springer, 1997. [174, 175]Google Scholar

Yue, Y. and Joachims, T.. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th International Conference on Machine Learning, pages 1201–1208. ACM, 2009. [315]Google Scholar

Yue, Y. and Joachims, T.. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning, pages 241–248, New York, NY, USA, June 2011. ACM. [315]Google Scholar

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T.. The k-armed dueling bandits problem. In Proceedings of the 22nd Conference on Learning Theory, 2009. [315]Google Scholar

Zimmert, J. and Lattimore, T.. Connections between mirror descent, thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, pages 11973–11982. Curran Associates, Inc., 2019. [416]Google Scholar

Zimmert, J. and Seldin, Y.. An optimal algorithm for stochastic and adversarial bandits. In AISTATS, pages 467–475, 2019. [136, 305, 315]Google Scholar

Zimmert, J., Luo, H., and Wei, C-Y.. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In Proceedings of the 36th International Conference on Machine Learning, pages 7683–7692, Long Beach, California, USA, 09–15 Jun 2019. JMLR.org. [305]Google Scholar

Zinkevich, M.. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–935. AAAI Press, 2003. [301]Google Scholar

Zoghi, M., Whiteson, S., Munos, R., and Rijke, M.. Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, Bejing, China, 2014. JMLR.org. [315]Google Scholar

Zoghi, M., Karnin, Z., Whiteson, S., and Rijke, M.. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315. Curran Associates, Inc., 2015. [315]Google Scholar

Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvári, Cs., and Wen, Z.. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, JMLR.org, pages 4199–4208, 2017. [351]Google Scholar

Zong, S., Ni, H., Sung, K., Ke, R. N., Wen, Z., and Kveton, B.. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016. [350, 351]Google Scholar

Book contents

Bibliography

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive