Optimistic Planning for Belief-Augmented Markov Decision Processes

Fonteneau, Raphaël; Busoniu, Lucian; Munos, Rémi

Paper published in a book (Scientific congresses and symposiums)

Fonteneau, Raphaël; Busoniu, Lucian; Munos, Rémi

2013 • In Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-13), Singapore, 15–19 April 2013

Peer reviewed

Permalink
https://hdl.handle.net/2268/161495

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

adprl.pdf

Author preprint (475.22 kB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Reinforcement Learning; Bayesian Optimization; Markov Decision Processes

Abstract :

[en] This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel model-based Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OP-MDP) algorithm [Busoniu2011,Busoniu2012] to contexts where the transition model of the MDP is initially unknown and progressively learned through interactions within the environment. The knowledge about the unknown MDP is represented with a probability distribution over all possible transition models using Dirichlet distributions, and the BOP algorithm plans in the belief-augmented state space constructed by concatenating the original state vector with the current posterior distribution over transition models. We show that BOP becomes Bayesian optimal when the budget parameter increases to infinity. Preliminary empirical validations show promising performance.

Disciplines :

Computer science

Author, co-author :

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Busoniu, Lucian; Technical University of Cluj-Napoca

Munos, Rémi; Inria Lille - Nord Europe

Language :

English

Title :

Optimistic Planning for Belief-Augmented Markov Decision Processes

Publication date :

2013

Event name :

IEEE International Symposium on Adaptive Dynamic Programming and reinforcement Learning (ADPRL 2013)

Event place :

Singapore, Singapore

Event date :

April 16-19, 2013

Audience :

International

Main work title :

Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-13), Singapore, 15–19 April 2013

Peer reviewed :

Peer reviewed

Available on ORBi :

since 18 January 2014

Statistics

Number of views

52 (7 by ULiège)

Number of downloads

199 (4 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

M. Araya, V. Thomas, and O. Buffet. Near-optimal BRL using optimistic local transitions. In International Conference on Machine Learning (ICML), 2012.
J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artificial Intelligence (UAI), pages 19-26, 2009.
J. Asmuth and ML Littman. Approaching Bayes-optimalilty using Monte-Carlo tree search. In International Conference on Automated Planning and Scheduling (ICAPS), Freiburg, Germany, 2011.
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of multiarmed bandit problems. Machine Learning, 47:235-256, 2002.
R. Bellman. Dynamic Programming. Princeton University Press, 1957.
R.I. Brafman and M. Tennenholtz. R-max-A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213-231, 2003.
S. Bubeck and R. Munos. Open loop optimistic planning. In Conference on Learning Theory (COLT), pages 477-489, 2010.
S. Bubeck, R. Munos, G. Stoltz, and C. Szepesv́ari. Online optimization in X-armed bandits. In Neural Information Processing Systems (NIPS), pages 201-208, 2009.
L. Busoniu and R. Munos. Optimistic planning for markov decision processes. In International Conference on Artificial Intelligence and Satistics (AISTATS), JMLR W &CP 22, pages 182-189, 2012.
L. Busoniu, R. Munos, B. De Schutter, and R. Babuska. Optimistic planning for sparsely stochastic systems. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 48-55, 2011.
P. Castro and D. Precup. Smarter sampling in model-based Bayesian reinforcement learning. Machine Learning and Knowledge Discovery in Databases, pages 200-214, 2010.
M. Castronovo, F. Maes, R. Fonteneau, and D. Ernst. Learning exploration/ exploitation strategies for single trajectory reinforcement learning. In European Workshop on Reinforcement Learning (EWRL), 2012.
R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. Computers and Games, pages 72-83, 2007.
R. Dearden, N. Friedman, and S. Russell. Bayesian Q-learning. In National Conference on Artificial Intelligence, pages 761-768, 1998.
C. Dimitrakakis. Tree exploration for Bayesian RL exploration. In International Conference on Computational Intelligence for Modelling Control &Automation, pages 1029-1034, 2008.
C. Dimitrakakis and M. G. Lagoudakis. Rollout sampling approximate policy iteration. Machine Learning, 72:157-171, 2008.
M.O.G. Duff. Optimal Learning: Computational procedures for Bayesadaptive Markov decision processes. PhD thesis, University of Massachusetts Amherst, 2002.
A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874-1039, 1960.
S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in Monte-Carlo go. Technical report, INRIA RR-6062, 2006.
J.C. Gittins. Multiarmed Bandit Allocation Indices. Wiley, 1989.
A. Guez, D. Silver, and P. Dayan. Efficient Bayes-adaptive reinforcement learning using sample-based search. In Neural Information Processing Systems (NIPS), 2012.
J.F. Hren and R. Munos. Optimistic planning of deterministic systems. Recent Advances in Reinforcement Learning, pages 151-164, 2008.
J.E. Ingersoll. Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc., 1987.
T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563-1600, 2010.
L. Kocsis and C. Szepesv́ari. Bandit based Monte-Carlo planning. Machine Learning: ECML 2006, pages 282-293, 2006.
J.Z. Kolter and A.Y. Ng. Near-bayesian exploration in polynomial time. In International Conference on Machine Learning (ICML), pages 513-520, 2009.
R. Munos. Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Neural Information Processing Systems (NIPS), 2011.
R. Munos. The optimistic principle applied to games, optimization and planning: Towards Foundations of Monte-Carlo Tree Search. Technical report, 2012.
S.A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65(2):331-366, 2003.
R. Ortner and P. Auer. Logarithmic online regret bounds for undiscounted reinforcement learning. In Neural Information Processing Systems (NIPS), 2007.
J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. In IEEE-RAS International Conference on Humanoid Robots, pages 1-20. Citeseer, 2003.
P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In International Conference on Machine Learning (ICML), pages 697-704, 2006.
M. Riedmiller. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning (ECML), pages 317-328, 2005.
D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. Neural Information Processing Systems (NIPS), 46, 2010.
J. Sorg, S. Singh, and R.L. Lewis. Variance-based rewards for approximate Bayesian reinforcement learning. Uncertainty in Artificial Intelligence (UAI), 2010.
M. Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning (ICML), pages 943-950, 2000.
R.S. Sutton. Learning to predict by the methods of temporal difference. Machine Learning, 3:9-44, 1988.
R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998.
T.J. Walsh, S. Goschin, and M.L. Littman. Integrating sample-based planning and model-based reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 2010.
A. Weinstein and M.L. Littman. Bandit-based planning and learning in continuous-action Markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS), 2012.

Similar publications

Sorry the service is unavailable at the moment. Please try again later.

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website