A cautious approach to generalization in reinforcement learning

[en] In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop.

Disciplines :

Computer science

Author, co-author :

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Murphy, Susan

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

A cautious approach to generalization in reinforcement learning

Publication date :

January 2010

Event name :

2nd International Conference on Agents and Artificial Intelligence

Event organizer :

Institute for Systems and Technologies of Information, Control and Communication

Event place :

Valencia, Spain

Event date :

from 22-01-2010 to 24-01-2010

Audience :

International

Main work title :

Proceedings of the 2nd International Conference on Agents and Artificial Intelligence

ISBN/EAN :

092312 1713

Pages :

Peer reviewed :

Peer reviewed

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique
FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture

Commentary :

Best Student Paper Award

Available on ORBi :

since 15 February 2010

Statistics

Number of views

294 (26 by ULiège)

Number of downloads

1012 (15 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Bemporad, A. and Morari, M. (1999). Robust model predictive control: A survey. Robustness in Identification and Control, 245:207-226.
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.
Boyan, J. and Moore, A. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369-376. MIT Press.
Csáji, B. C. and Monostori, L. (2008). Value function based reinforcement learning in changing markovian environments. J. Mach. Learn. Res., 9:1679-1709.
Delage, E. and Mannor, S. (2006). Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research.
Ernst, D. (2005). Selecting concise sets of samples for a reinforcement learning agent. In Proceedings of the Third International Conference on Computational Intelligence, Robotics and Autonomous Systems (CIRAS 2005), page 6.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503-556.
Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L. (April 2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 39:517-529.
Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEEADPRL 09), Nashville, TN, USA.
Gordon, G. (1999). Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University.
Ingersoll, J. (1987). Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc.
Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration. Jounal of Machine Learning Research, 4:1107-1149.
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004). Bias and variance in value function estimation. In Proceedings of the 21st International Conference on Machine Learning.
Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65(2):331-366.
Murphy, S. (2005). An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24:1455-1481.
Ormoneit, D. and Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2-3): 161-178.
Qian, M. and Murphy, S. (2009). Performance guarantee for individualized treatment rules. Submitted.
Riedmiller, M. (2005). Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Proceedings of the Sixteenth European Conference on Machine Learning (ECML 2005), pages 317-328.
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coding. In Advance in Neural Information Processing Systems 8, pages 1038-1044. MIT Press.
Sutton, R. and Barto, A. (1998). Reinforcement Learning. MIT Press.

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website