Skip to main content Accessibility help
×
Hostname: page-component-84b7d79bbc-5lx2p Total loading time: 0 Render date: 2024-07-27T22:20:34.738Z Has data issue: false hasContentIssue false

Bibliography

Published online by Cambridge University Press:  04 July 2020

Tor Lattimore
Affiliation:
University of Alberta
Csaba Szepesvári
Affiliation:
University of Alberta
Get access

Summary

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Type
Chapter
Information
Bandit Algorithms , pp. 484 - 512
Publisher: Cambridge University Press
Print publication year: 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abbasi-Yadkori, Y.. Forced-exploration based algorithms for playing in bandits with large action sets. PhD thesis, University of Alberta, 2009a. [213]Google Scholar
Abbasi-Yadkori, Y.. Forced-exploration based algorithms for playing in bandits with large action sets. Master’s thesis, University of Alberta, Department of Computing Science, 2009b. [79]Google Scholar
Abbasi-Yadkori, Y.. Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta, 2012. [214, 475]Google Scholar
Abbasi-Yadkori, Y. and Szepesvári, Cs.. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Conference on Learning Theory, pages 1–26, Budapest, Hungary, 2011. JMLR.org. [475]Google Scholar
Abbasi-Yadkori, Y. and Szepesvári, Cs.. Bayesian optimal control of smoothly parameterized systems. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, pages 2–11, Arlington, VA, United States, 2015. AUAI Press. [475]Google Scholar
Abbasi-Yadkori, Y., Antos, A., and Szepesvári, Cs.. Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009. [79, 213]Google Scholar
Abbasi-Yadkori, Y., Pál, D., and Szepesvári, Cs.. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320. Curran Associates, Inc., 2011. [213]Google Scholar
Abbasi-Yadkori, Y., Pál, D., and Szepesvári, Cs.. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 1–9, La Palma, Canary Islands, 2012. JMLR.org. [249]Google Scholar
Abbasi-Yadkori, Y., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesvári, Cs.. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems, pages 2508–2516, USA, 2013. Curran Associates Inc. [475]Google Scholar
Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., and Valko, M.. Best of both worlds: Stochastic & adversarial best-arm identification. In Proceedings of the 31st Conference on Learning Theory, 2018. [365]Google Scholar
Abe, N. and Long, P. M.. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the 16th International Conference on Machine Learning, pages 3–11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [213]Google Scholar
Abeille, M. and Lazaric, A.. Linear Thompson sampling revisited. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 176–184, Fort Lauderdale, FL, USA, 2017a. JMLR.org. [417]Google Scholar
Abeille, M. and Lazaric, A.. Thompson sampling for linear-quadratic control problems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1246–1254, Fort Lauderdale, FL, USA, 2017b. JMLR.org. [475]Google Scholar
Abernethy, J. D. and Rakhlin, A.. Beating the adaptive bandit with high probability. In Proceedings of the 22nd Conference on Learning Theory, 2009. [149, 301]Google Scholar
Abernethy, J. D., Hazan, E., and Rakhlin, A.. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Conference on Learning Theory, pages 263–274. Omnipress, 2008. [301]Google Scholar
Abernethy, J. D., Hazan, E., and Rakhlin, A.. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):41644175, 2012. [148, 299]Google Scholar
Abernethy, J. D., Lee, C., Sinha, A., and Tewari, A.. Online linear optimization via smoothing. In Proceedings of the 27th Conference on Learning Theory, pages 807–823, Barcelona, Spain, 2014. JMLR.org. [328]Google Scholar
Abernethy, J. D., Lee, C., and Tewari, A.. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems, pages 21972205. Curran Associates, Inc., 2015. [301, 328]Google Scholar
Abramowitz, M. and Stegun, I. A.. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1964. [158, 418]Google Scholar
Achab, M., Clémençon, S., Garivier, A., Sabourin, A., and Vernade, C.. Max k-armed bandit: On the extremehunter algorithm and beyond. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 389–404. Springer, 2017. [365]CrossRefGoogle Scholar
Adelman, L.. Choice theory. In Gass, Saul I. and Fu, Michael C., editors, Encyclopedia of Operations Research and Management Science, pages 164–168. Springer US, Boston, MA, 2013. [54]Google Scholar
Agarwal, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Rakhlin, A.. Stochastic convex optimization with bandit feedback. In Advances in Neural Information Processing Systems, pages 10351043. Curran Associates, Inc., 2011. [315]Google Scholar
Agarwal, A., Foster, D. P., Hsu, D., Kakade, S. M., and Rakhlin, A.. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213240, 2013. [364]CrossRefGoogle Scholar
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R.. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 16381646, Bejing, China, 2014. JMLR.org. [201, 202]Google Scholar
Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford, J., Lee, S., Li, J., Melamed, D., Oshri, G., and Ribas, O.. Making contextual decisions with low technical debt. arXiv:1606.03966, 2016. [11]Google Scholar
Agrawal, R.. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, pages 1054–1078, 1995. [92, 100]Google Scholar
Agrawal, S. and Devanur, N. R.. Bandits with concave rewards and convex knapsacks. In Proceedings of the 15th ACM conference on Economics and computation, pages 989–1006. ACM, 2014. [315]Google Scholar
Agrawal, S. and Devanur, N. R.. Linear contextual bandits with knapsacks. In Advances in Neural Information Processing Systems, pages 34583467. Curran Associates Inc., 2016. [315]Google Scholar
Agrawal, S. and Goyal, N.. Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Conference on Learning Theory, 2012. [416]Google Scholar
Agrawal, S. and Goyal, N.. Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, Scottsdale, Arizona, USA, 2013a. JMLR.org. [415, 417]Google Scholar
Agrawal, S. and Goyal, N.. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, Atlanta, GA, USA, 2013b. JMLR.org. [417]Google Scholar
Agrawal, S. and Jia, R.. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 11841194. Curran Associates, Inc., 2017. [474]Google Scholar
Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A.. Thompson sampling for the MNL-bandit. In Proceedings of the 2017 Conference on Learning Theory, pages 76–78, Amsterdam, Netherlands, 2017. JMLR.org. [417]Google Scholar
Ailon, N., Karnin, Z., and Joachims, T.. Reducing dueling bandits to cardinal bandits. In Proceedings of the 31st International Conference on Machine Learning, pages II–856–II–864. JMLR.org, 2014. [315]Google Scholar
Aldrich, J.. “but you have to remember P. J. Daniell of Sheffield”. Electronic Journal for History of Probability and Statistics, 3(2), 2007. [42]Google Scholar
Allenberg, C., Auer, P., Györfi, L., and Ottucsák, G.. Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Proceedings of the 17th International Conference on Algorithmic Learning Theory, pages 229–243, Berlin, Heidelberg, 2006. Springer-Verlag. [135, 148, 299]Google Scholar
Alon, N., Matias, Y., and Szegedy, M.. The space complexity of approximating the frequency moments. In Proceedings of the 28th annual ACM symposium on theory of computing, pages 20–29. ACM, 1996. [96]Google Scholar
Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y.. From bandits to experts: A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 16101618. Curran Associates, Inc., 2013. [316, 448]Google Scholar
Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T.. Online learning with feedback graphs: Beyond bandits. In Proceedings of the 28th Conference on Learning Theory, pages 23–35, Paris, France, 2015. JMLR.org. [316]Google Scholar
Anantharam, V., Varaiya, P., and Walrand, J.. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions on Automatic Control, 32(11):968976, 1987. [214]Google Scholar
Anderson, J. R., Dillon, J. L., and Hardaker, J. E.. Agricultural decision analysis. Monographs: Applied Economics. Iowa State University Press, 1977. [xiii]Google Scholar
Anscombe, F. J.. Sequential medical trials. Journal of the American Statistical Association, 58(302): 365383, 1963. [79]Google Scholar
Antos, A., Bartók, G., Pál, D., and Szepesvári, Cs.. Toward a classification of finite partial-monitoring games. Theoretical Computer Science, 473:77–99, 2013. [448]Google Scholar
Arapostathis, A., Borkar, V. S., Fernandez-Gaucherand, E., Ghosh, M. K., and Marcus, S. I.. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM Journal of Control and Optimization, 31(2):282344, 1993. [474]CrossRefGoogle Scholar
Arora, R., Dekel, O., and Tewari, A.. Online bandit learning against an adaptive adversary: From regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, Madison, WI, USA, 2012. Omnipress. [136]Google Scholar
Ashwinkumar, B., Langford, J., and Slivkins, A.. Resourceful contextual bandits. In Proceedings of the 27th Conference on Learning Theory, pages 11091134, Barcelona, Spain, 2014. JMLR.org. [315]Google Scholar
Audibert, J.-V. and Bubeck, S.. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:27852836, 2010a. [136]Google Scholar
Audibert, J.-Y. and Bubeck, S.. Minimax policies for adversarial and stochastic bandits. In Proceedings of the 22nd Conference on Learning Theory, pages 217–226, 2009. [108, 136, 305]Google Scholar
Audibert, J.-Y. and Bubeck, S.. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Conference on Learning Theory, 2010b. [338, 364]Google Scholar
Audibert, J.-Y., Munos, R., and Szepesvári, Cs.. Tuning bandit algorithms in stochastic environments. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages 150–165, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. [56, 70, 92, 95, 183]Google Scholar
Audibert, J.-Y., Munos, R., and Szepesvári, Cs.. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):18761902, 2009. [56]Google Scholar
Audibert, J.-Y., Bubeck, S., and Lugosi, G.. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):3145, 2013. [301]Google Scholar
Auer, P.. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397422, 2002. [213, 238]Google Scholar
Auer, P. and Chiang, C.. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Proceedings of the 29th Annual Conference on Learning Theory, pages 116–120, New York, NY, USA, 2016. JMLR.org. [136]Google Scholar
Auer, P. and Ortner, R.. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 4956. MIT Press, 2007. [475]Google Scholar
Auer, P. and Ortner, R.. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):5565, 2010. [82, 108]Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995 . Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995. [81, 125, 137, 174]Google Scholar
Auer, P., Cesa-Bianchi, N., and Fischer, P.. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235256, 2002a. [79, 92]Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):4877, 2002b. [148, 149, 175, 201, 338]Google Scholar
Auer, P., Ortner, R., and Szepesvári, Cs.. Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory, pages 454–468. Springer, 2007. [314]Google Scholar
Auer, P., Jaksch, T., and Ortner, R.. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems, pages 89–96, 2009. [474]Google Scholar
Auer, P., Gajane, P., and Ortner, R.. Adaptively tracking the best arm with an unknown number of distribution changes. In European Workshop on Reinforcement Learning 14, 2018. [338]Google Scholar
Auer, P., Gajane, P., and Ortner, R.. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Proceedings of the 32nd Conference on Learning Theory, 2019. [337, 338]Google Scholar
Awerbuch, B. and Kleinberg, R.. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th annual ACM symposium on theory of computing, pages 45–53. ACM, 2004. [328]Google Scholar
Axler, S. J.. Linear algebra done right, volume 2. Springer, 1997. [445]Google Scholar
Azar, M. G., Osband, I., and Munos, R.. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, Sydney, Australia, 06–11 Aug 2017. JMLR.org. [474]Google Scholar
Badanidiyuru, A., Kleinberg, R., and Slivkins, A.. Bandits with knapsacks. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 207–216. IEEE, 2013. [315]Google Scholar
Bartlett, P. L. and Tewari, A.. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 35–42, Arlington, VA, United States, 2009. AUAI Press. [479]Google Scholar
Bartók, G.. A near-optimal algorithm for finite partial-monitoring games against adversarial opponents. In Proceedings of the 26th Conference on Learning Theory, pages 696–710. JMLR.org, 2013. [447, 448]Google Scholar
Bartók, G. and Szepesvári, Cs.. Partial monitoring with side information. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pages 305–319, 2012. [448]Google Scholar
Bartók, G., Pál, D., and Szepesvári, Cs.. Toward a classification of finite partial-monitoring games. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, pages 224–238. Springer, 2010. [448]Google Scholar
Bartók, G., Zolghadr, N., and Szepesvári, Cs.. An adaptive algorithm for finite stochastic partial monitoring. In Proceedings of the 29th International Conference on Machine Learning, pages 17791786, USA, 2012. Omnipress. [448]Google Scholar
Bartók, G., Foster, D. P., Pál, D., Rakhlin, A., and Szepesvári, Cs.. Partial monitoring—classification, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):967997, 2014. [448]Google Scholar
Bastani, H. and Bayati, M.. Online decision making with high-dimensional covariates. Operations Research, 68(1):276294, 2020. [249]CrossRefGoogle Scholar
Bather, J. A. and Chernoff, H.. Sequential decisions in the control of a spaceship. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 181–207, 1967. [10]Google Scholar
Bayes. LII, T.. An essay towards solving a problem in the doctrine of chances. by the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philosophical transactions of the Royal Society of London, 53:370–418, 1763. [382]Google Scholar
Bellman, R.. The theory of dynamic programming. Technical report, RAND CORP SANTA MONICA CA, 1954. [473]Google Scholar
Bellman, R. E.. Eye of the Hurricane. World Scientific, 1984. [473]Google Scholar
Berend, D. and Kontorovich, A.. On the concentration of the missing mass. Electronic Communications in Probability, 18(3):17, 2013. [69]Google Scholar
Berger, J. O.. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media, 1985. [380]Google Scholar
Bernoulli, D.. Exposition of a new theory on the measurement of risk. Econometrica: Journal of the Econometric Society, pages 23–36, 1954. [54]Google Scholar
Berry, A. C.. The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American mathematical society, 49(1):122136, 1941. [64]Google Scholar
Berry, D. and Fristedt, B.. Bandit problems : sequential allocation of experiments. Chapman and Hall, London; New York, 1985. [11, 400, 401]Google Scholar
Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A.. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):21032116, 1997. [314]Google Scholar
Bertsekas, D. and Tsitsiklis, J. N.. Neuro-Dynamic Programming. Athena Scientific, 1st edition, 1996. [474]Google Scholar
Bertsekas, D. P.. Dynamic Programming and Optimal Control, volume 1-2. Athena Scientific, Belmont, MA, 4 edition, 2012. [472, 473, 474]Google Scholar
Bertsekas, D. P.. Convex optimization algorithms. Athena Scientific Belmont, 2015. [329]Google Scholar
Bertsimas, D. and Tsitsiklis, J. N.. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997. [474]Google Scholar
Besbes, O., Gur, Y., and Zeevi, A.. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems, pages 199–207. Curran Associates, Inc., 2014. [338]Google Scholar
Besson, L. and Kaufmann, E.. What doubling tricks can and can’t do for multi-armed bandits. arXiv:1803.06971, 2018. [81]Google Scholar
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. E.. An optimal high probability algorithm for the contextual bandit problem. arXiv:1002.4058, 2010. [148]Google Scholar
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R.. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 19–26, Fort Lauderdale, FL, USA, 2011. JMLR.org. [201, 204]Google Scholar
Billingsley, P.. Probability and measure. John Wiley & Sons, 2008. [32, 42]Google Scholar
Blackwell, D.. Controlled random walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336–338, 1954. [125]Google Scholar
Bogachev, V. I.. Measure theory, volume 2. Springer Science & Business Media, 2007. [33, 277]Google Scholar
Bonald, T. and Proutiere, A.. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In Advances in Neural Information Processing Systems, pages 2184–2192, 2013. [314]Google Scholar
Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E.. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 14(1):3207–3260, 2013. [149]Google Scholar
Boucheron, S., Lugosi, G., and Massart, P.. Concentration inequalities: A nonasymptotic theory of independence. OUP Oxford, 2013. [66, 300]Google Scholar
Bouneffouf, D. and Rish, I.. A survey on practical applications of multi-armed and contextual bandits. arXiv:1904.10040, 2019. [11]Google Scholar
Box, G. E. P.. Science and statistics. Journal of the American Statistical Association, 71(356):791799, 1976. [125]Google Scholar
Box, G. E. P.. Robustness in the strategy of scientific model building. Robustness in statistics,1: 201–236, 1979. [125]Google Scholar
Boyd, S. and Vandenberghe, L.. Convex optimization. Cambridge University Press, 2004. [275]Google Scholar
Bradt, R. N., Johnson, S. M., and Karlin, S.. On sequential designs for maximizing the sum of n observations. The Annals of Mathematical Statistics, pages 1060–1074, 1956. [401]Google Scholar
Brafman, R. and Tennenholtz, M.. R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213231, 2003. [475]Google Scholar
Bretagnolle, J. and Huber, C.. Estimation des densités: risque minimax. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 47(2):119137, 1979. [167]Google Scholar
Bubeck, S. and Cesa-Bianchi, N.. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1122, 2012. [10, 92, 136, 301, 327]Google Scholar
Bubeck, S. and Eldan, R.. The entropic barrier: a simple and optimal universal self-concordant barrier. In Proceedings of the 28th Conference on Learning Theory, pages 279–279, Paris, France, 2015. JMLR.org. [284]Google Scholar
Bubeck, S. and Eldan, R.. Multi-scale exploration of convex functions and bandit convex optimization. In Proceedings of the 29th Conference on Learning Theory, pages 583–589, New York, NY, USA, 2016. JMLR.org. [315, 416, 417]Google Scholar
Bubeck, S. and Liu, C.. Prior-free and prior-dependent regret bounds for Thompson sampling. In Advances in Neural Information Processing Systems, pages 638–646. Curran Associates, Inc., 2013. [417]Google Scholar
Bubeck, S. and Slivkins, A.. The best of both worlds: Stochastic and adversarial bandits. In Proceedings of the 25th Conference on Learning Theory, pages 42.1–42.23, 2012. [136]Google Scholar
Bubeck, S., Munos, R., and Stoltz, G.. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009. [364]Google Scholar
Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, Cs.. X-armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011. [314]Google Scholar
Bubeck, S., Cesa-Bianchi, N., and Kakade, S.. Towards minimax policies for online linear optimization with bandit feedback. In Proceedings of the 25th Conference on Learning Theory, pages 41–1. Microtome, 2012. [238, 283, 301]Google Scholar
Bubeck, S., Cesa-Bianchi, N., and Lugosi, G.. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013a. [96]Google Scholar
Bubeck, S., Perchet, V., and Rigollet, P.. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory, pages 122–134, Princeton, NJ, USA, 2013b. JMLR.org. [174]Google Scholar
Bubeck, S., Dekel, O., Koren, T., and Peres, Y.. Bandit convex optimization: T regret in one dimension. In Proceedings of the 28th Conference on Learning Theory, pages 266–278, Paris, France, 2015a. JMLR.org. [315, 416, 417]Google Scholar
Bubeck, S., Eldan, R., and Lehec, J.. Finite-time analysis of projected Langevin Monte Carlo. In Advances in Neural Information Processing Systems, pages 1243–1251. Curran Associates, Inc., 2015b. [284, 299]Google Scholar
Bubeck, S., Lee, Y.T., and Eldan, R.. Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 72–85, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4528-6. [315]Google Scholar
Bubeck, S., Cohen, M., and Li, Y.. Sparsity, variance and curvature in multi-armed bandits. In Proceedings of the 29th International Conference on Algorithmic Learning Theory, pages 111–127. JMLR.org, 07–09 Apr 2018. [148, 298, 301]Google Scholar
Burnetas, A. N. and Katehakis, M. N.. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122142, 1996. [100, 119, 181]Google Scholar
Burnetas, A. N. and Katehakis, M. N.. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222255, 1997a. [475]Google Scholar
Burnetas, A. N. and Katehakis, M. N.. On the finite horizon one-armed bandit problem. Stochastic Analysis and Applications, 16(1):845859, 1997b. [401]Google Scholar
Burnetas, A. N. and Katehakis, M. N.. Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probability in the Engineering and Informational Sciences, 17(1):5382, 2003. [401]Google Scholar
Bush, R. R. and Mosteller, F.. A stochastic model with applications to learning. The Annals of Mathematical Statistics, pages 559–585, 1953. [10]Google Scholar
Cappé, O., Garivier, A., Maillard, O., Munos, R., and Stoltz, G.. Kullback–Leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):15161541, 2013. [100, 119, 120, 181]Google Scholar
Carpentier, A. and Locatelli, A.. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning Theory, pages 590–604, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar
Carpentier, A. and Munos, R.. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 190–198, La Palma, Canary Islands, 2012. JMLR.org. [249, 311]Google Scholar
Carpentier, A. and Valko, M.. Extreme bandits. In Advances in Neural Information Processing Systems, pages 10891097. Curran Associates, Inc., 2014. [365]Google Scholar
Carpentier, A. and Valko, M.. Simple regret for infinitely many armed bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 11331141, Lille, France, 2015. PMLR. [314]Google Scholar
Catoni, O.. Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 48(4):11481185, 2012. [96]Google Scholar
Cesa-Bianchi, N. and Lugosi, G.. Prediction, learning, and games. Cambridge University Press, 2006. [10, 136, 301, 338, 381, 448]Google Scholar
Cesa-Bianchi, N. and Lugosi, G.. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):14041422, 2012. [327]Google Scholar
Cesa-Bianchi, N., Lugosi, G., and Stoltz, G.. Regret minimization under partial monitoring. Mathematics of Operations Research, 31:562580, 2006. [448]Google Scholar
Cesa-Bianchi, N., Gentile, C., Mansour, Y., and Minora, A.. Delay and cooperation in nonstochastic bandits. In Proceedings of the 29th Conference on Learning Theory, pages 605–622, New York, NY, USA, 2016. JMLR.org. [316]Google Scholar
Cesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G.. Boltzmann exploration done right. In Advances in Neural Information Processing Systems, pages 62846293. Curran Associates, Inc., 2017. [79]Google Scholar
Chakravorty, J. and Mahajan, A.. Multi-armed bandits, Gittins index, and its calculation. Methods and Applications of Statistics in Clinical Trials: Planning, Analysis, and Inferential Methods,2: 416435, 2013. [402]Google Scholar
Chakravorty, J. and Mahajan, A.. Multi-armed bandits, Gittins index, and its calculation. Methods and Applications of Statistics in Clinical Trials: Planning, Analysis, and Inferential Methods,2: 416435, 2014. [402]Google Scholar
Chan, H. P. and Lai, T. L.. Sequential generalized likelihood ratios and adaptive treatment allocation for optimal sequential selection. Sequential Analysis, 25:179201, 2006. [365]Google Scholar
Chang, J. T. and Pollard, D.. Conditioning as disintegration. Statistica Neerlandica, 51(3):287317, 1997. [382]Google Scholar
Chapelle, O. and Li, L.. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 22492257. Curran Associates, Inc., 2011. [416]Google Scholar
Chaudhuri, S. and Tewari, A.. Phased exploration with greedy exploitation in stochastic combinatorial partial monitoring games. In Advances in Neural Information Processing Systems, pages 2433–2441, 2016. [448]Google Scholar
Chen, C-H., Lin, J., Yücesan, E., and Chick, S. E.. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, 10(3):251270, 2000. [365]Google Scholar
Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W.. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379387. Curran Associates, Inc., 2014. [364]Google Scholar
Chen, W., Wang, Y., and Yuan, Y.. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pages 151–159, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. [328]Google Scholar
Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P.. Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 16591667. Curran Associates, Inc., 2016a. [328]Google Scholar
Chen, W., Wang, Y., Yuan, Y., and Wang, Q.. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1–33, 2016b. URL http://jmlr.org/papers/v17/14-298.html. [328]Google Scholar
Chen, Y., Lee, C-W., Luo, H., and Wei, C-Y.. A new algorithm for non-stationary contextual bandits: Efficient, optimal, and parameter-free. arXiv:1902.00980, 2019. [337, 338]Google Scholar
Chen, Y. R. and Katehakis, M. N.. Linear programming for finite state multi-armed bandit problems. Mathematics of Operations Research, 11(1):180183, 1986. [402]Google Scholar
Chernoff, H.. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755– 770, 1959. [11, 364]CrossRefGoogle Scholar
Chernoff, H.. A career in statistics. Past, Present, and Future of Statistical Science, page 29, 2014. [119]Google Scholar
Cheung, W., Simchi-Levi, D., and Zhu, R.. Learning to optimize under non-stationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 16–18 Apr 2019. [338]Google Scholar
Chu, W., Li, L., Reyzin, L., and Schapire, R.. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 208–214, Fort Lauderdale, FL, USA, 2011. JMLR.org. [238]Google Scholar
Chuklin, A., Markov, I., and de Rijke, M.. Click Models for Web Search. Morgan & Claypool Publishers, 2015. [351]Google Scholar
Cicirello, V. A. and Smith, S. F.. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In AAAI, pages 1355–1361, 2005. [365]Google Scholar
Cohen, A. and Hazan, T.. Following the perturbed leader for online structured learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1034–1042, Lille, France, 07–09 Jul 2015. JMLR.org. [328, 330]Google Scholar
Cohen, A., Hazan, T., and Koren, T.. Tight bounds for bandit combinatorial optimization. In Proceedings of the 2017 Conference on Learning Theory, pages 629–642, Amsterdam, Netherlands, 2017. JMLR.org. [326]Google Scholar
Combes, R., Magureanu, S., Proutiere, A., and Laroche, C.. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231–244. ACM, 2015a. ISBN 978-1-4503-3486-0. [351]Google Scholar
Combes, R., Shahi, M., Proutiere, A., and Lelarge, M.. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages 2116–2124. Curran Associates, Inc., 2015b. [327, 328]Google Scholar
Combes, R., Magureanu, S., and Proutière, A.. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1761–1769, 2017. [213, 215, 264, 314]Google Scholar
Conn, A. R., Scheinberg, K., and Vicente, L. N.. Introduction to Derivative-Free Optimization. SIAM, 2009. [364]CrossRefGoogle Scholar
Cover, T. M.. Universal portfolios. Mathematical Finance, 1(1):129, 1991. [284]Google Scholar
Cover, T. M. and Thomas, J. A.. Elements of information theory. John Wiley & Sons, 2012. [167]Google Scholar
Cowan, W. and Katehakis, M. N.. An asymptotically optimal policy for uniform bandits of unknown support. arXiv:1505.01918, 2015. [181]Google Scholar
Cowan, W., Honda, J., and Katehakis, M. N.. Normal bandits of unknown means and variances. Journal of Machine Learning Research, 18(154):128, 2018. [181]Google Scholar
Crammer, K. and Gentile, C.. Multiclass classification with bandit feedback using adaptive regularization. Machine learning, 90(3):347383, 2013. [249]Google Scholar
Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B.. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87–94. ACM, 2008. [351]Google Scholar
Dani, V. and Hayes, T. P.. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 937–943, 2006. [328]Google Scholar
Dani, V., Hayes, T. P., and Kakade, S. M.. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Conference on Learning Theory, pages 355–366, 2008. [213, 257]Google Scholar
de la Peña, V. H. T.L. Lai, , and Shao, Q.. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. [66, 71, 226]Google Scholar
Degenne, R. and Koolen, W. M.. Pure exploration with multiple correct answers. In Advances in Neural Information Processing Systems, pages 14591–14600. Curran Associates, Inc., 2019. [364]Google Scholar
Degenne, R. and Perchet, V.. Anytime optimal algorithms in stochastic multi-armed bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 15871595, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [108]Google Scholar
Degenne, R., Koolen, W. M., and Ménard, P.. Non-asymptotic pure exploration by solving games. In Advances in Neural Information Processing Systems, pages 14492–14501. Curran Associates, Inc., 2019. [364]Google Scholar
Dekel, O., Gentile, C., and Sridharan, K.. Robust selective sampling from single and multiple teachers. In Proceedings of the 23rd Conference on Learning Theory, pages 346–358, 2010. [249]Google Scholar
Dekel, O., Gentile, C., and Sridharan, K.. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:26552697, 2012. [249]Google Scholar
Dembo, A. and Zeitouni, O.. Large deviations techniques and applications, volume 38. Springer Science & Business Media, 2009. [68]Google Scholar
Denardo, E. V., Park, H., and Rothblum, U. G.. Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research, 32(2):374394, 2007. [56]Google Scholar
Desautels, T., Krause, A., and Burdick, J. W.. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:40534103, 2014. [316]Google Scholar
Dobrushin, R. L.. Eine allgemeine formulierung des fundamentalsatzes von shannon in der informationstheorie. Usp. Mat. Nauk, 14(6(90)):3–104, 1959. [167]Google Scholar
Dong, S. and Van Roy, B.. An information-theoretic analysis for Thompson sampling with many actions. In Advances in Neural Information Processing Systems, Red Hook, NY, USA, 2018. Curran Associates Inc. [415]Google Scholar
Doob, J. L.. Stochastic processes. Wiley, 1953. [43]Google Scholar
Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T.. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pages 169–178. AUAI Press, 2011. [202]Google Scholar
Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M.. Contextual dueling bandits. In Proceedings of the 28th Conference on Learning Theory, pages 563–587, Paris, France, 2015. JMLR.org. [315]Google Scholar
Dudley, R. M.. Uniform central limit theorems, volume 142. Cambridge University Press, 2014. [66, 300]Google Scholar
Esseen, C. G.. On the Liapounoff limit of error in the theory of probability. Almqvist & Wiksell, 1942. [64]Google Scholar
Even-Dar, E., Mannor, S., and Mansour, Y.. PAC bounds for multi-armed bandit and Markov decision processes. In Computational Learning Theory, pages 255–270. Springer, 2002. [364, 368]Google Scholar
Even-Dar, E., Kakade, S. M., and Mansour, Y.. Experts in a Markov decision process. In Advances in Neural Information Processing Systems, pages 401–408, Cambridge, MA, USA, 2004. MIT Press. [475]Google Scholar
Even-Dar, E., Mannor, S., and Mansour, Y.. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research,7: 10791105, 2006. [364]Google Scholar
Fedorov, V. V.. Theory of optimal experiments. Academic Press, New York, 1972. [235]Google Scholar
Filippi, S., Cappé, O., Garivier, A., and Szepesvári, Cs.. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594. Curran Associates, Inc., 2010. [213]Google Scholar
Fink, D.. A compendium of conjugate priors, 1997. [382]Google Scholar
Foster, D. and Rakhlin, A.. No internal regret via neighborhood watch. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 382–390, La Palma, Canary Islands, 2012. JMLR.org. [448]Google Scholar
Foster, D. J. and Rakhlin, A.. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. arXiv:2002.04926, 2020. [256]Google Scholar
Frank, M. and Wolfe, P.. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95110, 1956. [235]Google Scholar
Frederick, S., Loewenstein, G., and O’donoghue, T.. Time discounting and time preference: A critical review. Journal of Economic Literature, 40(2):351401, 2002. [400]Google Scholar
Frostig, E. and Weiss, G.. Four proofs of Gittins’ multiarmed bandit theorem. Applied Probability Trust, 70, 1999. [401]Google Scholar
Fruit, R., Pirotta, M., and Lazaric, A.. Near optimal exploration-exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2997–3007, 2018. [474, 475, 482, 483]Google Scholar
Gai, Y., Krishnamachari, B., and Jain, R.. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012. [328]Google Scholar
Gajane, P., Ortner, R., and Auer, P.. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. arXiv:1805.10066, 2018. [338]Google Scholar
Garivier, A.. Informational confidence bounds for self-normalized averages and applications. arXiv:1309.3376, 2013. [93, 108]Google Scholar
Garivier, A. and Cappé, O.. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Conference on Learning Theory, 2011. [118, 119]Google Scholar
Gazrivier, A. and Kaufmann, E.. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, pages 998–1027, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar
Garivier, A. and Moulines, E.. On upper-confidence bound policies for switching bandit problems. In Proceedings of the 22nd International Conference on Algorithmic Learning Theory, pages 174188, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. [338]Google Scholar
Garivier, A. E. Kaufmann, , and Koolen, W. M.. Maximin action identification: A new bandit framework for games. In Proceedings of the 29th Conference on Learning Theory, pages 10281050, New York, NY, USA, 2016a. JMLR.org. [364]Google Scholar
Garivier, A., Lattimore, T., and Kaufmann, E.. On explore-then-commit strategies. In Advances in Neural Information Processing Systems, pages 784–792. Curran Associates, Inc., 2016b. [79, 100, 181]Google Scholar
Garivier, A., Ménard, P., and Stoltz, G.. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377399, 2019. [181]Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., and Rubin, D.B.. Bayesian data analysis, volume 2. CRC Press Boca Raton, FL, 2014. [382]Google Scholar
Gentile, C. and Orabona, F.. On multilabel classification and ranking with partial feedback. In Advances in Neural Information Processing Systems, pages 11511159. Curran Associates, Inc., 2012. [249]Google Scholar
Gentile, C. and Orabona, F.. On multilabel classification and ranking with bandit feedback. Journal of Machine Learning Research, 15(1):24512487, 2014. [249]Google Scholar
Gerchinovitz, S.. Sparsity regret bounds for individual sequences in online linear regression. Journal of Machine Learning Research, 14(Mar):729769, 2013. [249]Google Scholar
Gerchinovitz, S. and Lattimore, T.. Refined lower bounds for adversarial bandits. In Advances in Neural Information Processing Systems, pages 1198–1206. Curran Associates, Inc., 2016. [174, 190]Google Scholar
Ghosal, S. and van der Vaart, A.. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017. [382]Google Scholar
Ghosh, A., Chowdhury, S. R., and Gopalan, A.. Misspecified linear bandits. In 31st AAAI Conference on Artificial Intelligence, 2017. [238]Google Scholar
Gittins, J.. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2):148–177, 1979. [337, 401]Google Scholar
Gittins, J., Glazebrook, K., and Weber, R.. Multi-armed bandit allocation indices. John Wiley & Sons, 2011. [11, 337, 401]Google Scholar
Glowacka, D.. Bandit algorithms in information retrieval. Foundations and Trends® in Information Retrieval, 13:299–424, 01 2019. [351]Google Scholar
Glynn, P. and Juneja, S.. Ordinal optimization – empirical large deviations rate estimators, and stochastic multi-armed bandits. arXiv:1507.04564, 2015. [365]Google Scholar
Goldsman, D.. Ranking and selection in simulation. In 15th conference on Winter Simulation, pages 387–394, 1983. [365]Google Scholar
Gopalan, A. and Mannor, S.. Thompson sampling for learning parameterized Markov decision processes. In Proceedings of the 28th Conference on Learning Theory, pages 861–898, Paris, France, 2015. JMLR.org. [417]Google Scholar
Gordon, G. J.. Regret bounds for prediction problems. In Proceedings of the 12th Conference on Learning Theory, pages 29–40, 1999. [301]Google Scholar
Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R.. Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning, pages 13–20, USA, 2010. Omnipress. [416]Google Scholar
Granmo, O.. Solving two-armed bernoulli bandit problems using a Bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics, 3(2):207234, 2010. [416]Google Scholar
Gray, R. M.. Entropy and information theory. Springer Science & Business Media, 2011. [167]Google Scholar
Greenewald, K., Tewari, A., Murphy, S., and Klasnja, P.. Action centered contextual bandits. In Advances in Neural Information Processing Systems, pages 59775985. Curran Associates, Inc., 2017. [11]Google Scholar
Grötschel, M., Lovász, L., and Schrijver, A.. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012. [235, 327, 474]Google Scholar
Guo, F., Liu, C., and Wang, Y. M.. Efficient multiple-click models in web search. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, pages 124–131. ACM, 2009. [351]Google Scholar
György, A. and Szepesvári, Cs.. Shifting regret, mirror descent, and matrices. In Proceedings of the 33rd International Conference on Machine Learning, pages 29432951, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [338]Google Scholar
György, A., Linder, T., Lugosi, G., and Ottucsák, G.. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):23692403, 2007. [328, 329]Google Scholar
György, A., Pál, D., and Szepesvári, Cs.. Online learning: Algorithms for Big Data. 2019. [338]Google Scholar
Halmos, P. R.. Measure Theory. Graduate Texts in Mathematics. Springer New York, 1976. [42]Google Scholar
Hamidi, N. and Bayati, M.. A general framework to analyze stochastic linear bandit. arXiv:2002.05152, 2020. [415]Google Scholar
Hanawal, M., Saligrama, V., Valko, M., and Munos, R.. Cheap bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 2133–2142, Lille, France, 07–09 Jul 2015. JMLR.org. [315]Google Scholar
Hannan, J.. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3: 97139, 1957. [125, 301, 328]Google Scholar
Hao, B., Lattimore, T., and Cs. Szepesvári. Adaptive exploration in linear contextual bandit. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020. [213, 264]Google Scholar
Hardy, G. H.. Divergent Series. Oxford University Press, 1973. [472]Google Scholar
Hazan, E.. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2 (3-4):157–325, 2016. [300, 301]Google Scholar
Hazan, E. and Kale, S.. A simple multi-armed bandit algorithm with optimal variation-bounded regret. In Proceedings of the 24th Conference on Learning Theory, pages 817–820. JMLR.org, 2011. [148]Google Scholar
Hazan, E., Karnin, Z., and Meka, R.. Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 17(119):134, 2016. [235, 284]Google Scholar
Helmbold, D. P., Littlestone, N., and Long, P. M.. Apple tasting. Information and Computation, 161(2):85139, 2000. [448]Google Scholar
Herbster, M. and Warmuth, M. K.. Tracking the best expert. Machine Learning, 32(2):151178, 1998. [338]Google Scholar
Herbster, M. and Warmuth, M. K.. Tracking the best linear predictor. Journal of Machine Learning Research, 1(Sep):281309, 2001. [338]Google Scholar
Ho, Y-C., Sreenivas, R. S., and Vakili, P.. Ordinal optimization of DEDS. Discrete Event Dynamic Systems, 1992. [365]Google Scholar
Honda, J. and Takemura, A.. An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 23rd Conference on Learning Theory, pages 67–79, 2010. [100, 109, 119, 181]Google Scholar
Honda, J. and Takemura, A.. An asymptotically optimal policy for finite support models in the multiarmed bandit problem. Machine Learning, 85(3):361391, 2011. [100]Google Scholar
Honda, J. and Takemura, A.. Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, pages 375–383, Reykjavik, Iceland, 2014. JMLR.org. [417]Google Scholar
Honda, J. and Takemura, A.. Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. Journal of Machine Learning Research, 16:37213756, 2015. [119, 181]Google Scholar
Hu, X., Prashanth, L.A., György, A., and Szepesvári, Cs.. (Bandit) convex optimization with biased noisy gradient oracles. In AISTATS, pages 819–828, 2016. [315, 364]Google Scholar
Huang, R., Ajallooeian, M. M., Szepesvári, Cs., and Müller, M.. Structured best arm identification with fixed confidence. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 593–616, Kyoto, Japan, 2017a. JMLR.org. [364]Google Scholar
Huang, R., Lattimore, T., György, A., and Szepesvári, Cs.. Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities. Journal of Machine Learning Research, 18:131, 2017b. [300]Google Scholar
Huang, W., Ok, J., Li, L., and Chen, W.. Combinatorial pure exploration with continuous and separable reward functions and its applications. In IJCAI, pages 2291–2297, 2018. [364]Google Scholar
Hutter, M.. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004. [381]Google Scholar
Hutter, M. and Poland, J.. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6:639660, 2005. [328]Google Scholar
Ionides, E. L.. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295311, 2008. [149]Google Scholar
Ivanenko, V. I. and Labkovsky, V. A.. On regularities of mass random phenomena. arXiv:1204.4440, 2013. [125]Google Scholar
Jaksch, T., Auer, P., and Ortner, R.. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563–1600, August 2010. ISSN 1532-4435. [474, 476]Google Scholar
Jamieson, K. and Nowak, R.. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pages 1–6. IEEE, 2014. [364]Google Scholar
Jamieson, K. and Talwalkar, A.. Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 240–248, 2016. [365]Google Scholar
Jamieson, K., Katariya, S., Deshpande, A., and Nowak, R.. Sparse dueling bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 416–424, San Diego, CA, USA, 2015. JMLR.org. [315]Google Scholar
Jaynes, E. T.. Probability theory: the logic of science. Cambridge University Press, 2003. [381, 382]Google Scholar
Jefferson, A., Bortolotti, L., and Kuzmanovic, B.. What is unrealistic optimism? Consciousness and Cognition, 50:311, 2017. [92]Google Scholar
Joulani, P., György, A., and Szepesvári, Cs.. Online learning under delayed feedback. In Proceedings of the 30th International Conference on Machine Learning, pages 14531461, Atlanta, GA, USA, 2013. JMLR.org. [316]Google Scholar
Joulani, P., György, A., and Szepesvári, Cs.. A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, and variational bounds. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 681–720, Kyoto University, Kyoto, Japan, 2017. JMLR.org. [298]Google Scholar
Jun, K., Bhargava, A., Nowak, R., and Willett, R.. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems, pages 99–109. Curran Associates, Inc., 2017. [213]Google Scholar
Kaelbling, L. P.. Learning in embedded systems. MIT Press, 1993. [92]Google Scholar
Kahneman, D. and Tversky, A.. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–91, 1979. [54]Google Scholar
Kakade, S.. On The Sample Complexity Of Reinforcement Learning. PhD thesis, University College London, 2003. [475]Google Scholar
Kakade, S. M., Shalev-Shwartz, S., and Tewari, A.. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th International Conference on Machine Learning, pages 440–447, 2008. [202]Google Scholar
Kalai, A. and Vempala, S.. Geometric algorithms for online optimization. Technical Report MIT-LCS-TR-861, MIT, 2002. [301, 328]Google Scholar
Kalai, A. and Vempala, S.. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291307, 2005. [328]Google Scholar
Kallenberg, L.. A note on M.N. Katehakis’ and Y.-R. Chen’s computation of the Gittins index. Mathematics of Operations Research, 11(1):184186, 1986. [402]Google Scholar
Kallenberg, L.. Markov decision processes: Lecture notes, 2016. [474]Google Scholar
Kallenberg, O.. Foundations of modern probability. Springer-Verlag, 2002. [32, 33, 41, 42, 43, 168, 228, 383]Google Scholar
Karnin, Z., Koren, T., and Somekh, O.. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, pages 12381246, Atlanta, GA, USA, 2013. JMLR.org. [364]Google Scholar
El Karoui, N. and Karatzas, I.. Dynamic allocation problems in continuous time. The Annals of Applied Probability, pages 255–286, 1994. [402]Google Scholar
Katariya, S., Kveton, B., Szepesvári, Cs., and Wen, Z.. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33rd International Conference on Machine Learning, pages 1215–1224, 2016. [351]Google Scholar
Katariya, S., Kveton, B., Szepesvári, Cs., Vernade, C., and Wen, Z.. Bernoulli rank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017a. [351]Google Scholar
Katariya, S., Kveton, B., Szepesvári, Cs., Vernade, C., and Wen, Z.. Stochastic rank-1 bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017b. [351]Google Scholar
Katehakis, M. N. and Robbins, H.. Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92(19):8584, 1995. [92, 100]Google Scholar
Ya Katkovnik, V. and Kulchitsky, Yu. Convergence of a class of random search algorithms. Automation Remote Control, 8:13211326, 1972. [364]Google Scholar
Kaufmann, E.. On Bayesian index policies for sequential resource allocation. The Annals of Statistics, 46(2):842865, 04 2018. [100, 109, 415, 417]Google Scholar
Kaufmann, E., Cappé, O., and Garivier, A.. On Bayesian upper confidence bounds for bandit problems. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 592–600, La Palma, Canary Islands, 2012a. JMLR.org. [415, 417]Google Scholar
Kaufmann, E., Korda, N., and Munos, R.. Thompson sampling: An asymptotically optimal finite-time analysis. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 199–213. Springer Berlin Heidelberg, 2012b. ISBN 978-3-642-34105-2. [100, 415, 417]Google Scholar
Kawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S.. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems, pages 12971305. Curran Associates, Inc., 2015. [417]Google Scholar
Kazerouni, A., Ghavamzadeh, M., Abbasi, Y., and Van Roy, B.. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 39103919. Curran Associates, Inc., 2017. [315]Google Scholar
Kearns, M. and Saul, L.. Large deviation methods for approximate probabilistic inference. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence, page 319. Morgan Kaufmann Publishers Inc., 1998. [69]Google Scholar
Kearns, M. and Singh, S.. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209232, 2002. [475]Google Scholar
Kearns, M. J. and Vazirani, U. V.. An introduction to computational learning theory. MIT Press, 1994. [202]Google Scholar
Kiefer, J. and Wolfowitz, J.. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(5):363365, 1960. [235]Google Scholar
Kim, G-S. and Paik, M. C.. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, pages 58775887. Curran Associates, Inc., 2019. [249]Google Scholar
Kim, M. J.. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):64156422, 2017. [417]Google Scholar
Kirschner, J. and Krause, A.. Information directed sampling and bandits with heteroscedastic noise. In Proceedings of the 31st Conference On Learning Theory, pages 358–384. PMLR, 06–09 Jul 2018. [72, 213]Google Scholar
Kirschner, J., Lattimore, T., and Krause, A.. Information directed sampling for linear partial monitoring. arXiv preprint arXiv:2002.11182, 2020. [448]Google Scholar
Kleinberg, R.. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697–704. MIT Press, 2005. [314]Google Scholar
Kleinberg, R., Slivkins, A., and Upfal, E.. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pages 681–690. ACM, 2008. [314]Google Scholar
Kocák, T., Neu, G., Valko, M., and Munos, R.. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems,pages 613–621. Curran Associates, Inc., 2014. [148, 149, 316]Google Scholar
Kocák, T., Valko, M., Munos, R., and Agrawal, S.. Spectral Thompson sampling. In AAAI, pages 1911–1917, 2014. [417]Google Scholar
Kocsis, L. and Szepesvári, Cs.. Discounted UCB. In 2nd PASCAL Challenges Workshop, pages 784–791, 2006. [11, 338]Google Scholar
Komiya, H.. Elementary proof for Sion’s minimax theorem. Kodai Mathematical Journal, 11(1):57, 1988. [301]Google Scholar
Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H.. Regret lower bound and optimal algorithm in dueling bandit problem. In Proceedings of the 28th Conference on Learning Theory, pages 11411154, Paris, France, 2015a. JMLR.org. [315]Google Scholar
Komiyama, J., Honda, J., and Nakagawa, H.. Regret lower bound and optimal algorithm in finite stochastic partial monitoring. In Advances in Neural Information Processing Systems, pages 17921800. Curran Associates, Inc., 2015b. [448]Google Scholar
Koolen, W. M., Warmuth, M. K., and Kivinen, J.. Hedging structured concepts. In Proceedings of the 23rd Conference on Learning Theory, pages 93–105. Omnipress, 2010. [328]Google Scholar
Korda, N., Kaufmann, E., and Munos, R.. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pages 14481456. Curran Associates, Inc., 2013. [100, 120, 415, 417]Google Scholar
Kujala, J. and Elomaa, T.. On following the perturbed leader in the bandit setting. In Proceedings of the 16th International Conference on Algorithmic Learning Theory, pages 371–385, 2005. [328]Google Scholar
Kujala, J. and Elomaa, T.. Following the perturbed leader to gamble at multi-armed bandits. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pages 166–180. Springer, 2007. [328]Google Scholar
Kulkarni, S. R. and Lugosi, G.. Finite-time lower bounds for the two-armed bandit problem. IEEE Transactions on Automatic Control, 45(4):711714, 2000. [181]Google Scholar
Kveton, B., Szepesvári, Cs., Wen, Z., and Ashkan, A.. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, pages 767–776. JMLR.org, 2015a. [351]Google Scholar
Kveton, B., Wen, Z., Ashkan, A., and Szepesvári, Cs.. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 535–543, San Diego, CA, USA, 2015b. JMLR.org. [328]Google Scholar
Kveton, B., Wen, Z., Ashkan, Z., and Szepesvári, Cs.. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, pages 14501458. Curran Associates Inc., 2015c. [351]Google Scholar
Kveton, B., Szepesvári, Cs., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M.. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In Proceedings of the 36th International Conference on Machine Learning, pages 36013610, Long Beach, California, USA, 09–15 Jun 2019. PMLR. [417]Google Scholar
Lagree, P., Vernade, C., and Cappé, O.. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems, pages 15971605. Curran Associates Inc., 2016. [351]Google Scholar
Lai, T. L.. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, pages 1091–1114, 1987. [92, 100, 109, 119, 401]Google Scholar
Lai, T. L.. Martingales in sequential analysis and time series, 19451985. Electronic Journal for history of probability and statistics, 5(1), 2009. [226]Google Scholar
Lai, T. L. and Graves, T.. Asymptotically efficient adaptive choice of control laws in controlled Markov chains. SIAM Journal on Control and Optimization, 35(3):715743, 1997. [475]Google Scholar
Lai, T. L. and Robbins, H.. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):422, 1985. [56, 92, 100, 119, 181, 230]Google Scholar
Langford, J. and Zhang, T.. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pages 817–824. Curran Associates, Inc., 2008. [204]Google Scholar
Laplace, P.. Pierre-Simon Laplace Philosophical Essay on Probabilities: Translated from the fifth French edition of 1825 With Notes by the Translator, volume 13. Springer Science & Business Media, 2012. [33]Google Scholar
Lattimore, T.. The Pareto regret frontier for bandits. In Advances in Neural Information Processing Systems, pages 208–216. Curran Associates, Inc., 2015a. [136, 257]Google Scholar
Lattimore, T.. Optimally confident UCB: Improved regret for finite-armed bandits. arXiv:1507.07880, 2015b. [108]Google Scholar
Lattimore, T.. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of the 29th Annual Conference on Learning Theory, pages 12141245, New York, NY, USA, 2016a. JMLR.org. [100, 401]Google Scholar
Lattimore, T.. Regret analysis of the anytime optimally confident ucb algorithm. arXiv:1603.08661, 2016b. [108]Google Scholar
Lattimore, T.. Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of the 29th Conference on Learning Theory, pages 1214–1245, 2016c. [400]Google Scholar
Lattimore, T.. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in Neural Information Processing Systems, pages 15841593. Curran Associates, Inc., 2017. [96, 181]Google Scholar
Lattimore, T.. Refining the confidence level for optimistic bandit strategies. Journal of Machine Learning Research, 2018. [82, 108, 110, 181]Google Scholar
Lattimore, T. and Hutter, M.. PAC bounds for discounted MDPs. In Proceedings of the 23th International Conference on Algorithmic Learning Theory, volume 7568 of Lecture Notes in Computer Science, pages 320–334. Springer Berlin / Heidelberg, 2012. [475]Google Scholar
Lattimore, T. and Munos, R.. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages 550–558. Curran Associates, Inc., 2014. [214]Google Scholar
Lattimore, T. and Szepesvári, Cs.. The end of optimism? An asymptotic analysis of finite-armed linear bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 728–737, Fort Lauderdale, FL, USA, 2017. JMLR.org. [213, 264]Google Scholar
Lattimore, T. and Szepesvári, Cs.. Cleaning up the neighbourhood: A full classification for adversarial partial monitoring. In Proceedings of the 30th International Conference on Algorithmic Learning Theory, 2019a. [447, 448, 450]Google Scholar
Lattimore, T. and Szepesvári, Cs.. Learning with good feature representations in bandits and in RL with a generative model. arXiv:1911.07676, 2019b. [238, 256]Google Scholar
Lattimore, T. and Szepesvári, Cs.. An information-theoretic approach to minimax regret in partial monitoring. In Proceedings of the 32nd Conference on Learning Theory, pages 2111–2139, Phoenix, USA, 2019c. PMLR. [416, 417, 419, 448]Google Scholar
Lattimore, T. and Szepesvári, Cs.. Exploration by optimisation in partial monitoring. arXiv:1907.05772, 2019d. [446, 448, 451]Google Scholar
Lattimore, T., Crammer, K., and Szepesvári, Cs.. Linear multi-resource allocation with semi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964972. Curran Associates, Inc., 2015. [249]Google Scholar
Lattimore, T., Kveton, B., Li, S., and Szepesvári, Cs.. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 39493958. Curran Associates, Inc., 2018. [230, 330, 351]Google Scholar
Laurent, B. and Massart, P.. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000. [66]Google Scholar
Lazaric, A. and Munos, R.. Hybrid stochastic-adversarial on-line learning. In Proceedings of the 22nd Conference on Learning Theory, 2009. [202]Google Scholar
Le, T., Szepesvári, Cs., and Zheng, R.. Sequential learning for multi-channel wireless network monitoring with channel switching costs. IEEE Transactions on Signal Processing, 62(22):59195929, 2014. [11]Google Scholar
Le Cam, L. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, 1 (1):38–53, 1973. [174]Google Scholar
Lee, Y. T., Sidford, A., and Vempala, S. S.. Efficient convex optimization with membership oracles. In Proceedings of the 31st Conference On Learning Theory, pages 1292–1294. JMLR.org, 06–09 Jul 2018. [327]Google Scholar
Lehmann, E. L. and Casella, G.. Theory of point estimation. Springer Science & Business Media, 2006. [382]Google Scholar
Lei, H., Tewari, A., and Murphy, S. A.. An actor-critic contextual bandit algorithm for personalized mobile health interventions. arXiv:1706.09090, 2017. [11]Google Scholar
Leike, J., Lattimore, T., Orseau, L., and Hutter, M.. Thompson sampling is asymptotically optimal in general environments. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, pages 417–426. AUAI Press, 2016. [417]Google Scholar
Lerche, H. R.. Boundary crossing of Brownian motion: Its relation to the law of the iterated logarithm and to sequential analysis. Springer, 1986. [110]Google Scholar
Levin, D. A. and Peres, Y.. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. [42]Google Scholar
Levin, L. A.. On the notion of a random sequence. Soviet Mathematics Doklady, 14(5):14131416, 1973. [125]Google Scholar
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A.. Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185): 152, 2018. [365]Google Scholar
Li, S., Wang, B., Zhang, S., and Chen, W.. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 1245–1253, 2016. [351]Google Scholar
Li, S., Lattimore, T., and Szepesvári, Cs.. Online learning to rank with features. In Proceedings of the 36th International Conference on Machine Learning, pages 3856–3865, Long Beach, California, USA, 09–15 2019a. PMLR. [350]Google Scholar
Li, Y., Wang, Y., and Zhou, Y.. Nearly minimax-optimal regret for linearly parameterized bandits. In Proceedings of the 32nd Conference on Learning Theory, pages 2173–2174, Phoenix, USA, 2019b. JMLR.org. [212]Google Scholar
Liang, T., Narayanan, H., and Rakhlin, A.. On zeroth-order stochastic convex optimization via random walks. arXiv:1402.2667, 2014. [364]Google Scholar
Lin, T., Abrahao, B., Kleinberg, R., Lui, J., and Chen, W.. Combinatorial partial monitoring game with linear feedback and its applications. In Proceedings of the 31st International Conference on Machine Learning, pages 901–909, Bejing, China, 22–24 Jun 2014. PMLR. [448]Google Scholar
Lin, T., Li, J., and Chen, W.. Stochastic online greedy learning with semi-bandit feedbacks. In Advances in Neural Information Processing Systems, pages 352–360. Curran Associates, Inc., 2015. [328]Google Scholar
Littlestone, N. and Warmuth, M. K.. The weighted majority algorithm. Information and Computation, 108(2):212261, 1994. [125, 137]Google Scholar
Lovász, L. and Vempala, S.. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307358, 2007. [284]Google Scholar
Luo, H., Wei, C-Y., Agarwal, A., and Langford, J.. Efficient contextual bandits in non-stationary worlds. In Proceedings of the 31st Conference On Learning Theory, pages 1739–1776. JMLR.org, 06–09 Jul 2018. [338]Google Scholar
MacKay, D.. Information theory, inference and learning algorithms. Cambridge University Press, 2003. [167]Google Scholar
Magureanu, S., Combes, R., and Proutière, A.. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of the 27th Conference on Learning Theory, pages 975–999, 2014. [215, 314]Google Scholar
Maillard, O.. Robust risk-averse stochastic multi-armed bandits. In Proceedings of the 24th International Conference on Algorithmic Learning Theory, pages 218–233. Springer, Berlin, Heidelberg, 2013. [56]Google Scholar
Maillard, O., Munos, R., and Stoltz, G.. Finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Proceedings of the 24th Conference on Learning Theory, 2011. [119]Google Scholar
Mannor, S. and Shamir, O.. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692. Curran Associates, Inc., 2011. [316, 448]Google Scholar
Mannor, S. and Shimkin, N.. On-line learning with imperfect monitoring. In Learning Theory and Kernel Machines, pages 552–566. Springer, 2003. [448]Google Scholar
Mannor, S. and Tsitsiklis, J. N.. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623648, December 2004. [364]Google Scholar
Mannor, S., Perchet, V., and Stoltz, G.. Set-valued approachability and online learning with partial monitoring. The Journal of Machine Learning Research, 15(1):3247 3295, 2014. [448]Google Scholar
Markowitz, H.. Portfolio selection. The Journal of Finance, 7(1):7791, 1952. [55]Google Scholar
Maron, M. E. and Kuhns, J. L.. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216244, 1960. [352]Google Scholar
Martin-Löf, P.. The definition of random sequences. Information and Control, 9(6):602619, 1966. [125]Google Scholar
Maurer, A. and Pontil, M.. Empirical Bernstein bounds and sample variance penalization. arXiv:0907.3740, 2009. [70, 95]Google Scholar
May, B. C., Korda, N., Lee, A., and Leslie, D. S.. Optimistic Bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research, 13(1):20692106, 2012. [416]Google Scholar
McDiarmid, C.. Concentration. In Probabilistic methods for algorithmic discrete mathematics, pages 195–248. Springer, 1998. [66, 71, 228]Google Scholar
McMahan, H. B. and Blum, A.. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Conference on Learning Theory, volume 3120, pages 109–123. Springer, 2004. [328]Google Scholar
McMahan, H. B. and Streeter, M. J.. Tighter bounds for multi-armed bandits with expert advice. In Proceedings of the 22nd Conference on Learning Theory, 2009. [201]Google Scholar
Ménard, P. and Garivier, A.. A minimax and asymptotically optimal algorithm for stochastic bandits. In Proceedings of the 28th International Conference on Algorithmic Learning Theory, pages 223–237, Kyoto University, Kyoto, Japan, 15–17 Oct 2017. JMLR.org. [100, 108, 119]Google Scholar
Meyn, S. P. and Tweedie, R. L.. Markov chains and stochastic stability. Springer Science & Business Media, 2012. [41, 42]Google Scholar
Mnih, V., Szepesvári, Cs., and Audibert, J.-Y.. Empirical Bernstein stopping. In Proceedings of the 25th International Conference on Machine Learning, pages 672–679, New York, NY, USA, 2008. ACM. [70, 95]Google Scholar
Mukherjee, S., Naveen, KP., Sudarsanam, N., and Ravindran, B.. Efficient-UCBV: An almost optimal algorithm using variance estimates. In 32nd AAAI Conference on Artificial Intelligence, 2018. [108]Google Scholar
Nelder, J. A. and Wedderburn, R. W. M.. Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3):370384, 1972. [213]Google Scholar
Nemirovsky, A. S.. Efficient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody, 15, 1979. [301]Google Scholar
Nemirovsky, A. S. and Yudin, D. B.. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. [301, 364, 365]Google Scholar
Neu, G.. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 31683176. Curran Associates, Inc., 2015a. [148, 149, 201, 328]Google Scholar
Neu, G.. First-order regret bounds for combinatorial semi-bandits. In Proceedings of the 28th Conference on Learning Theory, pages 13601375, Paris, France, 2015b. JMLR.org. [148, 299]Google Scholar
Neu, G., György, A., Szepesvári, Cs., and Antos, A.. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676691, December 2014. [475]Google Scholar
Von Neumann, J. and Morgenstern, O.. Theory of Games and Economic Behavior. Princeton University Press, Princeton, 1944. [55]Google Scholar
Niño-Mora, J.. Computing a classic index for finite-horizon bandits. INFORMS Journal on Computing, 23(2):254267, 2011. [402]Google Scholar
O’Donoghue, B., Chu, E., Parikh, N., and Boyd, S.. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042 1068, 2016. [450]Google Scholar
O’Donoghue, B., Chu, E., Parikh, N., and Boyd, S.. SCS: Splitting conic solver, version 2.1.1. https://github.com/cvxgrp/scs, November 2017. [450]Google Scholar
Ok, J., Proutiere, A., and Tranos, D.. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, Red Hook, NY, USA, 2018. Curran Associates Inc. [213, 264]Google Scholar
Ortega, P. A. and Braun, D. A.. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research, pages 475–511, 2010. [416]Google Scholar
Ortner, R. and Ryabko, D.. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763–1771, USA, 2012. Curran Associates Inc. [475]Google Scholar
Ortner, R., Ryabko, D., Auer, P., and Munos, R.. Regret bounds for restless Markov bandits. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pages 214–228, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. [337]Google Scholar
Osband, I. and Van Roy, B.. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning, pages 2701–2710, Sydney, Australia, 06–11 Aug 2017. JMLR.org. [475]Google Scholar
Osband, I., Russo, D., and Van Roy, B.. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011. Curran Associates, Inc., 2013. [417, 475]Google Scholar
Ostrovsky, E. and Sirota, L.. Exact value for subgaussian norm of centered indicator random variable. arXiv:1405.6749, 2014. [69]Google Scholar
Pandelis, D. G. and Teneketzis, D.. On the optimality of the gittins index rule for multi-armed bandits with multiple plays. Mathematical Methods of Operations Research, 50(3):449461, 1999. [400]Google Scholar
Papadimitriou, C. H. and Tsitsiklis, J. N.. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441450, 1987. [471]Google Scholar
Papadimitriou, C. H. and Vempala, S.. On the approximability of the traveling salesman problem. Combinatorica, 26(1):101120, 2006. [328]Google Scholar
Perchet, V.. Approachability of convex sets in games with partial monitoring. Journal of Optimization Theory and Applications, 149(3):665677, 2011. [448]Google Scholar
Perchet, V. and Rigollet, P.. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693721, 04 2013. [215]Google Scholar
Peskirand, G. Shiryaev, A.. Optimal stopping and free-boundary problems. Springer, 2006. [43, 401, 402]Google Scholar
Piccolboni, A. and Schindelhauer, C.. Discrete prediction games with arbitrary feedback and loss. In Computational Learning Theory, pages 208–223. Springer, 2001. [448]Google Scholar
Pike-Burke, C., Agrawal, S., Szepesvári, Cs., and Grünewälder, S.. Bandits with delayed, aggregated anonymous feedback. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4102–4110. JMLR.org, 10–15 Jul 2018. [316]Google Scholar
Poland, J.. FPL analysis for adaptive bandits. In Lupanov, O. B., Kasim-Zade, O. M., Chaskin, A. V., and Steinhöfel, K., editors, Stochastic Algorithms: Foundations and Applications, pages 58–69, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [328]Google Scholar
Pollard, D.. A user’s guide to measure theoretic probability, volume 8. Cambridge University Press, 2002. [32]Google Scholar
Presman, E. L. and Sonin, I. N.. Sequential control with incomplete information. The Bayesian approach to multi-armed bandit problems. Academic Press, 1990. [11, 401]Google Scholar
Puterman, M.. Markov decision processes: discrete stochastic dynamic programming, volume 414. Wiley, 2009. [473, 474, 476]Google Scholar
Qin, C., Klabjan, D., and Russo, D.. Improving the expected improvement algorithm. In Advances in Neural Information Processing Systems, pages 53815391. Curran Associates, Inc., 2017. [364]Google Scholar
Radlinski, F., Kleinberg, R., and Joachims, T.. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791. ACM, 2008. [350, 351, 352]Google Scholar
Rafferty, A. N., Ying, H., and Williams, J. J.. Bandit assignment for educational experiments: Benefits to students versus statistical power. In Artificial Intelligence in Education, pages 286–290. Springer, 2018. [11]Google Scholar
Rakhlin, A. and Sridharan, K.. BISTRO: An efficient relaxation-based method for contextual bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 19771985, 2016. [202]Google Scholar
Rakhlin, A. and Sridharan, K.. On equivalence of martingale tail bounds and deterministic regret inequalities. In Proceedings of the 30th Conference on Learning Theory, pages 17041722, Amsterdam, Netherlands, 2017. JMLR.org. [249]Google Scholar
Rakhlin, A., Shamir, O., and Sridharan, K.. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012. [365]Google Scholar
Rios, L. M. and Sahinidis, N. V.. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56(3):12471293, Jul 2013. [364]Google Scholar
Robbins, H.. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527535, 1952. [10, 56, 78, 79]Google Scholar
Robbins, H. and Siegmund, D.. Boundary crossing probabilities for the wiener process and sample sums. The Annals of Mathematical Statistics, pages 14101429, 1970. [226]Google Scholar
Robbins, H. and Siegmund, D.. A class of stopping rules for testing parametric hypotheses. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, pages 37–41. University of California Press, 1972. [230]Google Scholar
Robbins, H., Sigmund, D., and Chow, Y.. Great expectations: the theory of optimal stopping. Houghton-Nifflin, 7:631640, 1971. [401]Google Scholar
Robertson, S.. The probability ranking principle in IR. Journal of Documentation, 33(4):294304, 1977. [352]Google Scholar
Rockafellar, R. T.. Convex analysis. Princeton university press, 2015. [275, 329]Google Scholar
Rockafellar, R. T. and Uryasev, S.. Optimization of conditional value-at-risk. Journal of Risk, 2:2142, 2000. [55]Google Scholar
Rogers, C. A.. Packing and covering. Cambridge University Press, 1964. [226]Google Scholar
Ross, S. M.. Introduction to Stochastic Dynamic Programming. Academic Press, New York, 1983. [474]Google Scholar
Rusmevichientong, P. and Tsitsiklis, J. N.. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395411, 2010. [79, 213, 257]Google Scholar
Russo, D.. Simple Bayesian algorithms for best arm identification. In Proceedings of the 29th Annual Conference on Learning Theory, pages 14171418, New York, NY, USA, 2016. JMLR.org. [364]Google Scholar
Russo, D. and Van Roy, B.. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 22562264. Curran Associates, Inc., 2013. [214]Google Scholar
Russo, D. and Van Roy, B.. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 15831591. Curran Associates, Inc., 2014a. [213, 416, 417]Google Scholar
Russo, D. and Van Roy, B.. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):12211243, 2014b. [417]Google Scholar
Russo, D. and Van Roy, B.. An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(1):24422471, 2016. ISSN 1532-4435. [328, 415, 417]Google Scholar
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z.. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):196, 2018. [417]Google Scholar
Rustichini, A.. Minimizing regret: The general case. Games and Economic Behavior, 29(1):224243, 1999. [447, 448]Google Scholar
Salomon, A., Audibert, J., and Alaoui, I.. Lower bounds and selectivity of weak-consistent policies in stochastic multi-armed bandit problem. Journal of Machine Learning Research, 14(Jan):187207, 2013. [181]Google Scholar
Samuelson, P.. A note on measurement of utility. The Review of Economic Studies, 4(2):pp. 155161, 1937. [400]Google Scholar
Sani, A., Lazaric, A., and Munos, R.. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 32753283. Curran Associates, Inc., 2012. [56]Google Scholar
Seldin, Y. and Lugosi, G.. An improved parametrization and analysis of the EXP3++ algorithm for stochastic and adversarial bandits. In Proceedings of the 2017 Conference on Learning Theory, pages 17431759, Amsterdam, Netherlands, 2017. JMLR.org. [136]Google Scholar
Seldin, Y. and Slivkins, A.. One practical algorithm for both stochastic and adversarial bandits. In Proceedings of the 31st International Conference on Machine Learning, pages 12871295, Bejing, China, 2014. JMLR.org. [136]Google Scholar
Shalev-Shwartz, S.. Online learning: Theory, algorithms, and applications. PhD thesis, The Hebrew University of Jerusalem, 2007. [301]Google Scholar
Shalev-Shwartz, S.. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107194, 2012. [300, 301]Google Scholar
Shalev-Shwartz, S. and Ben-David, S.. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [201, 202, 204]Google Scholar
Shalev-Shwartz, S. and Singer, Y.. A primal-dual perspective of online learning algorithms. Machine Learning, 69(2-3):115142, 2007. [301]Google Scholar
Shamir, O.. On the complexity of bandit and derivative-free stochastic convex optimization. In Proceedings of the 26th Conference on Learning Theory, pages 3–24. JMLR.org, 2013. [315, 364]Google Scholar
Shamir, O.. On the complexity of bandit linear optimization. In Proceedings of the 28th Conference on Learning Theory, pages 15231551, Paris, France, 2015. JMLR.org. [257, 311]Google Scholar
Sharot, T.. The optimism bias. Current Biology, 21(23):R941–R945, 2011a. [91, 92]Google Scholar
Sharot, T.. The optimism bias: A tour of the irrationally positive brain. Pantheon/Random House, 2011b. [92]Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G. J. Schrittwieser, I. Antonoglou, V. Panneershelvam, , and Lanctot, M.. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484489, 2016. [11]Google Scholar
Silvey, S. D. and Sibson, B.. Discussion of Dr. Wynn’s and of Dr. Laycock’s papers. Journal of Royal Statistical Society (B), 34:174175, 1972. [235]Google Scholar
Sion, M.. On general minimax theorems. Pacific Journal of mathematics, 8(1):171176, 1958. [300]Google Scholar
Slivkins, A.. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):25332568, 2014. [314]Google Scholar
Slivkins, A.. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12 (1-2):1–286, 2019. ISSN 1935-8237. [10, 314]Google Scholar
Slivkins, A. and Upfal, E.. Adapting to a changing environment: the Brownian restless bandits. In Proceedings of the 21st Conference on Learning Theory, pages 343–354, 2008. [338]Google Scholar
Soare, M., Lazaric, A., and Munos, R.. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836. Curran Associates, Inc., 2014. [264, 364]Google Scholar
Sonin, I. M.. A generalized Gittins index for a Markov chain and its recursive calculation. Statistics and Probability Letters, 78(12):15261533, 2008. [402]Google Scholar
Srebro, N., Sridharan, K., and Tewari, A.. On the universality of online mirror descent. In Advances in neural information processing systems, pages 26452653, 2011. [299]Google Scholar
Sridharan, K. and Tewari, A.. Convex games in banach spaces. In Proceedings of the 23rd Conference on Learning Theory, pages 1–13. Omnipress, 2010. [301]Google Scholar
Srinivas, N., Krause, A., Kakade, S., and Seeger, M.. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, page 10151022, Madison, WI, USA, 2010. Omnipress. [214]Google Scholar
Stoltz, G.. Incomplete information and internal regret in prediction of individual sequences. PhD thesis, Université Paris Sud-Paris XI, 2005. [137]Google Scholar
Strasser, H.. Mathematical theory of statistics: statistical experiments and asymptotic decision theory, volume 7. Walter de Gruyter, 2011. [382]Google Scholar
Strauch, R. E.. Negative dynamic programming. The Annals of Mathematical Statistics, 37(4):871–890, 08 1966. [476]Google Scholar
Streeter, M. J. and Smith, S. F.. A simple distribution-free approach to the max k-armed bandit problem. In International Conference on Principles and Practice of Constraint Programming, pages 560–574. Springer, 2006a. [365]Google Scholar
Streeter, M. J and Smith, S. F.. An asymptotically optimal algorithm for the max k-armed bandit problem. In Proceedings of the National Conference on Artificial Intelligence, pages 135–142, 2006b. [365]Google Scholar
Strehl, A. and Littman, M.. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd International Conference on Machine learning, pages 856–863, New York, NY, USA, 2005. ACM. [475]Google Scholar
Strehl, A. and Littman, M.. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):13091331, 2008. [475, 481]Google Scholar
Strehl, A., Li, L., Wiewiora, E., Langford, J., and Littman, M.. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888, New York, NY, USA, 2006. ACM. [475]Google Scholar
Strens, M. J. A.. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 943–950, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. [474]Google Scholar
Sui, Y., Gotovos, A., Burdick, J., and Krause, A.. Safe exploration for optimization with gaussian processes. In Proceedings of the 32nd International Conference on Machine Learning, pages 997–1005, Lille, France, 07–09 Jul 2015. JMLR.org. [315]Google Scholar
Sun, Q., Zhou, W., and Fan, J.. Adaptive huber regression: Optimality and phase transition. arXiv:1706.06991, 2017. [96]Google Scholar
Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, 1998. [79, 400]Google Scholar
Sutton, R. and Barto, A.. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018. [474]Google Scholar
Swart, J.M.. Large deviation theory, January 2017. URL http://staff.utia.cas.cz/swart/lecture_notes/LDP8.pdf. [68]Google Scholar
Syrgkanis, V., Krishnamurthy, A., and Schapire, R.. Efficient algorithms for adversarial contextual learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 21592168, New York, NY, USA, 2016. JMLR.org. [202]Google Scholar
Szepesvári, Cs.. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [474]Google Scholar
Szita, I. and L˝orincz, A.. Optimistic initialization and greediness lead to polynomial time learning in factored MDPs. In Proceedings of the 26th International Conference on Machine Learning, pages 10011008, New York, USA, 2009. ACM. [475]Google Scholar
Szita, I. and Szepesvári, Cs.. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, pages 10311038, USA, 2010. Omnipress. [475]Google Scholar
Takimoto, E. and Warmuth, M. K.. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773818, 2003. [329]Google Scholar
Talagrand, M.. The missing factor in Hoeffding’s inequalities. Annales de l’IHP Probabilités et Statistiques, 31(4):689702, 1995. [65]Google Scholar
Taraldsen, G.. Optimal learning from the Doob-Dynkin lemma. arXiv:1801.00974, 2018. [30]Google Scholar
Teevan, J., Dumais, S. T., and Horvitz, E.. Characterizing the value of personalizing search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 757–758, New York, NY, USA, 2007. ACM. [352]Google Scholar
Tewari, A. and Bartlett, P. L.. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Advances in Neural Information Processing Systems, pages 15051512. Curran Associates, Inc., 2008. [474]Google Scholar
Tewari, A. and Murphy, S. A.. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017. [201]Google Scholar
Theocharous, G., Wen, Z., Abbasi-Yadkori, Y., and Vlassis, N.. Posterior sampling for large scale reinforcement learning. arXiv:1711.07979, 2017. [475]Google Scholar
Thompson, W.. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285294, 1933. [10, 55, 79, 404, 414, 416]Google Scholar
Thompson, W. R.. On the theory of apportionment. American Journal of Mathematics, 57(2):450456, 1935. [473]Google Scholar
Todd, M. J.. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016. [235]Google Scholar
Tolkien, J. R. R.. The Hobbit. Ballantine Books, 1937. [404]Google Scholar
Tran-Thanh, L., Chapman, A., Munoz de Cote, E., Rogers, A., and Jennings, N. R.. Epsilon–first policies for budget–limited multi-armed bandits. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, AAAI, pages 12111216, 2010. [315]Google Scholar
Tran-Thanh, L., Chapman, A., Rogers, A., and Jennings, N. R.. Knapsack based optimal policies for budget-limited multi-armed bandits. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, AAAI’12, pages 11341140. AAAI Press, 2012. [315]Google Scholar
Tropp, J. A.. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1230, 2015. [66]Google Scholar
Tsitsiklis, J. N.. A short proof of the Gittins index theorem. The Annals of Applied Probability, pages 194–199, 1994. [401]Google Scholar
Tsybakov, A. B.. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [167]Google Scholar
Ionescu Tulcea, C.. Mesures dans les espaces produits. Atti Accademia Nazionale Lincei Rend,7: 208–211, 1949–50. [43]Google Scholar
Uchibe, E. and Doya, K.. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004. [149]Google Scholar
Valko, M.. Bandits on graphs and structures, 2016. [217, 316]Google Scholar
Valko, M., Carpentier, A., and Munos, R.. Stochastic simultaneous optimistic optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 19–27, Atlanta, GA, USA, 2013a. JMLR.org. [364]Google Scholar
Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N.. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, pages 654–663, Arlington, VA, USA, 2013b. AUAI Press. [214]Google Scholar
Valko, M., Munos, R., Kveton, B., and Kocák, T.. Spectral bandits for smooth graph functions. In Proceedings of the 31st International Conference on Machine Learning, pages 46–54, Bejing, China, 2014. JMLR.org. [214, 217, 238]Google Scholar
van de Geer, S.. Empirical Processes in M-estimation, volume 6. Cambridge University Press, 2000. [66, 226, 300]Google Scholar
van der Hoeven, D., van Erven, T., and Kotłowski, W.. The many faces of exponential weights in online learning. In Proceedings of the 31st Conference on Learning Theory, pages 20672092, 2018. [284]Google Scholar
van der Vaart, A. W. and Wellner, J. A.. Weak Convergence and Empirical Processes. Springer, New York, 1996. [300]Google Scholar
Vanchinathan, H. P., Bartók, G., and Krause, A.. Efficient partial monitoring with prior information. In Advances in Neural Information Processing Systems, pages 16911699. Curran Associates, Inc., 2014. [448]Google Scholar
Vapnik, V.. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998. [204]Google Scholar
Varaiya, P., Walrand, J., and Buyukkoc, C.. Extensions of the multiarmed bandit problem: The discounted case. IEEE Transactions on Automatic Control, 30(5):426439, 1985. [402]Google Scholar
Vernade, C., Cappé, O., and Perchet, V.. Stochastic bandit models for delayed conversions. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017. [316]Google Scholar
Vernade, C., Carpentier, A., Zasppella, G., Ermis, B., and Brueckner, M.. Contextual bandits under delayed feedback. arXiv:1807.02089, 2018. [316]Google Scholar
Villar, S., Bowden, J., and Wason, J.. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199215, 2015. [11]Google Scholar
Vogel, W.. An asymptotic minimax theorem for the two armed bandit problem. The Annals of Mathematical Statistics, 31(2):444451, 1960. [174]Google Scholar
von Neumann, J.. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295320, 1928. [301]Google Scholar
Vovk, V. G.. Aggregating strategies. Proceedings of Computational Learning Theory, 1990. [125, 137]Google Scholar
Wang, S. and Chen, W.. Thompson sampling for combinatorial semi-bandits. In Proceedings of the 35th International Conference on Machine Learning, pages 51145122, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. JMLR.org. [328, 417]Google Scholar
Wang, Y., Audibert, J-Y., and Munos, R.. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems, pages 17291736, 2009. [314]Google Scholar
Warmuth, M. K. and Jagota, A.. Continuous and discrete-time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic Proceedings of the 5th International Symposium on Artificial Intelligence and Mathematics, 1997. [301]Google Scholar
Wawrzynski, P. L and Pacut, A.. Truncated importance sampling for reinforcement learning with experience replay. In Proceedings of the International Multiconference on Computer Science and Information Technology, pages 305–315, 2007. [149]Google Scholar
Weber, R.. On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2(4): 10241033, 1992. [401]Google Scholar
Weber, R. and Weiss, G.. On an index policy for restless bandits. Journal of Applied Probability, 27 (3):637648, 1990. [402]Google Scholar
Wei, C-Y. and Luo, H.. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory, pages 12631291. JMLR.org, 06–09 Jul 2018. [299, 301, 304]Google Scholar
Weinberger, M. J. and Ordentlich, E.. On delayed prediction of individual sequences. In Information Theory, 2002. Proceedings. 2002 IEEE International Symposium on, page 148. IEEE, 2002. [316]Google Scholar
Wen, Z., Kveton, B., and Ashkan, A.. Efficient learning in large-scale combinatorial semi-bandits. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 11131122, Lille, France, 2015. JMLR.org. [328]Google Scholar
Whittle, P.. Multi-armed bandits and the Gittins index. Journal of the Royal Statistical Society (B), pages 143–149, 1980. [401]Google Scholar
Whittle, P.. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287298, 1988. [337, 402]Google Scholar
Williams, D.. Probability with martingales. Cambridge University Press, 1991. [32]Google Scholar
Wu, H. and Liu, X.. Double Thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649–657. Curran Associates, Inc., 2016. [315]Google Scholar
Wu, Y., György, A., and Szepesvári, Cs.. Online learning with gaussian payoffs and side observations. In Advances in Neural Information Processing Systems, pages 13601368. Curran Associates Inc., 2015. [448]Google Scholar
Wu, Y., Shariff, R., Lattimore, T., and Szepesvári, Cs.. Conservative bandits. In Proceedings of the 33rd International Conference on Machine Learning, pages 12541262, New York, NY, USA, 20–22 Jun 2016. JMLR.org. [315]Google Scholar
Wynn, H. P.. The sequential generation of D-optimum experimental designs. The Annals of Mathematical Statistics, pages 16551664, 1970. [235]Google Scholar
Xia, Y., Li, H., Qin, T., Yu, N., and Liu, T.-Y.. Thompson sampling for budgeted multi-armed bandits. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI, pages 39603966. AAAI Press, 2015. [315]Google Scholar
Yao, Y.. Some results on the Gittins index for a normal reward process. In Time Series and Related Topics, pages 284–294. Institute of Mathematical Statistics, 2006. [402]Google Scholar
Yu, B.. Assouad, Fano, and Cam, Le. In D. Pollard, E. Torgersen, and G. L. Yang, editors, Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, pages 423–435. Springer, 1997. [174, 175]Google Scholar
Yue, Y. and Joachims, T.. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th International Conference on Machine Learning, pages 12011208. ACM, 2009. [315]Google Scholar
Yue, Y. and Joachims, T.. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning, pages 241–248, New York, NY, USA, June 2011. ACM. [315]Google Scholar
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T.. The k-armed dueling bandits problem. In Proceedings of the 22nd Conference on Learning Theory, 2009. [315]Google Scholar
Zimmert, J. and Lattimore, T.. Connections between mirror descent, thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, pages 1197311982. Curran Associates, Inc., 2019. [416]Google Scholar
Zimmert, J. and Seldin, Y.. An optimal algorithm for stochastic and adversarial bandits. In AISTATS, pages 467–475, 2019. [136, 305, 315]Google Scholar
Zimmert, J., Luo, H., and Wei, C-Y.. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In Proceedings of the 36th International Conference on Machine Learning, pages 76837692, Long Beach, California, USA, 09–15 Jun 2019. JMLR.org. [305]Google Scholar
Zinkevich, M.. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–935. AAAI Press, 2003. [301]Google Scholar
Zoghi, M., Whiteson, S., Munos, R., and Rijke, M.. Relative upper confidence bound for the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning, pages 10–18, Bejing, China, 2014. JMLR.org. [315]Google Scholar
Zoghi, M., Karnin, Z., Whiteson, S., and Rijke, M.. Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315. Curran Associates, Inc., 2015. [315]Google Scholar
Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvári, Cs., and Wen, Z.. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, JMLR.org, pages 4199–4208, 2017. [351]Google Scholar
Zong, S., Ni, H., Sung, K., Ke, R. N., Wen, Z., and Kveton, B.. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016. [350, 351]Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×