Learning to play K-armed  bandit problems

Maes, Francis; Wehenkel, Louis; Ernst, Damien

Download

Paper published in a book (Scientific congresses and symposiums)

Learning to play K-armed bandit problems

Maes, Francis; Wehenkel, Louis; Ernst, Damien

2012 • In Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012)

Peer reviewed

Permalink
https://hdl.handle.net/2268/101066

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

icaart-2012.pdf

Publisher postprint (138.45 kB)

Download

Annexes

Ernst-INRIA-2011-talk.pdf

Publisher postprint (398.54 kB)

This paper together with the two papers "Optimized look-ahead tree search policies" and "Automatic discovery of ranking formulas for playing with multi-armed bandits" are part of a body of work that focuses on the automatic learning of good strategies for exploration-(exploitation) problems in RL. This file is a presentation of this body of work.

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Multi-armed bandit problems; reinforcement learning; exploration-exploitation dilemma

Abstract :

[en] We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm ﬁrst samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear combination of features describing the history of plays (e.g., number of draws, average reward, variance of rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a ﬁxed number of Bernoulli arms, a ﬁxed horizon, and uniformly distributed parameters of the Bernoulli arms. These simulations show that learned strategies perform very well with respect to several other strategies previously proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and $\epsilon_n$-GREEDY); they also highlight the robustness of these strategies with respect to wrong prior information.

Disciplines :

Computer science

Author, co-author :

Maes, Francis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Language :

English

Title :

Learning to play K-armed bandit problems

Publication date :

February 2012

Event name :

4th International Conference on Agents and Artificial Intelligence (ICAART 2012)

Event place :

Vilamoura, Algarve, Portugal

Event date :

6-8 February 2012

Audience :

International

Main work title :

Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012)

Peer reviewed :

Peer reviewed

Available on ORBi :

since 26 October 2011

Statistics

Number of views

358 (20 by ULiège)

Number of downloads

540 (9 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Agrawal, R. (1995). Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, 27:1054-1078.
Audibert, J., Munos, R., and Szepesvari, C. (2007). Tuning bandit algorithms in stochastic environments. Algorithmic Learning Theory (ALT), pages 150-165.
Audibert, J., Munos, R., and Szepesvari, C. (2008). Exploration- exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science.
Auer, P., Fischer, P., and Cesa-Bianchi, N. (2002). Finite-time analysis of the multi-armed bandit problem. Machine Learning, 47:235-256. (Pubitemid 34126111)
Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. CoRR, abs/1102.2490.
Gonzalez, C., Lozano, J., and Larrañaga, P. (2002). Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation, pages 99-124. Kluwer Academic Publishers.
Ishii, S., Yoshida, W., and Yoshimoto, J. (2002). Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Networks, 15:665-687. (Pubitemid 34947467)
Lai, T. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4-22.
Mersereau, A., Rusmevichientong, P., and Tsitsiklis, J. (2009). A structured multiarmed bandit problem and the greedy policy. IEEE Trans. Automatic Control, 54:2787-2802.
Pelikan, M. and Mühlenbein, H. (1998). Marginal distributions in evolutionary algorithms. In Proceedings of the International Conference on Genetic Algorithms Mendel '98, pages 90-95, Brno, Czech Republic.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of The American Mathematical Society, 58:527-536.
Rubenstein, R. and Kroese, D. (2004). The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simluation, and machine learning. Springer, New York.
Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press.