[en] In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Murphy, Susan
Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Langue du document :
Anglais
Titre :
A cautious approach to generalization in reinforcement learning
Date de publication/diffusion :
janvier 2010
Nom de la manifestation :
2nd International Conference on Agents and Artificial Intelligence
Organisateur de la manifestation :
Institute for Systems and Technologies of Information, Control and Communication
Lieu de la manifestation :
Valencia, Espagne
Date de la manifestation :
from 22-01-2010 to 24-01-2010
Manifestation à portée :
International
Titre de l'ouvrage principal :
Proceedings of the 2nd International Conference on Agents and Artificial Intelligence
ISBN/EAN :
092312 1713
Pagination :
10
Peer review/Comité de sélection :
Peer reviewed
Organisme subsidiant :
F.R.S.-FNRS - Fonds de la Recherche Scientifique FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture
Bemporad, A. and Morari, M. (1999). Robust model predictive control: A survey. Robustness in Identification and Control, 245:207-226.
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.
Boyan, J. and Moore, A. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7, pages 369-376. MIT Press.
Csáji, B. C. and Monostori, L. (2008). Value function based reinforcement learning in changing markovian environments. J. Mach. Learn. Res., 9:1679-1709.
Delage, E. and Mannor, S. (2006). Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research.
Ernst, D. (2005). Selecting concise sets of samples for a reinforcement learning agent. In Proceedings of the Third International Conference on Computational Intelligence, Robotics and Autonomous Systems (CIRAS 2005), page 6.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503-556.
Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L. (April 2009). Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 39:517-529.
Fonteneau, R., Murphy, S., Wehenkel, L., and Ernst, D. (2009). Inferring bounds on the performance of a control policy from a sample of trajectories. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEEADPRL 09), Nashville, TN, USA.
Gordon, G. (1999). Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University.
Ingersoll, J. (1987). Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc.
Lagoudakis, M. and Parr, R. (2003). Least-squares policy iteration. Jounal of Machine Learning Research, 4:1107-1149.
Mannor, S., Simester, D., Sun, P., and Tsitsiklis, J. (2004). Bias and variance in value function estimation. In Proceedings of the 21st International Conference on Machine Learning.
Murphy, S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65(2):331-366.
Murphy, S. (2005). An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24:1455-1481.
Ormoneit, D. and Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2-3): 161-178.
Qian, M. and Murphy, S. (2009). Performance guarantee for individualized treatment rules. Submitted.
Riedmiller, M. (2005). Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In Proceedings of the Sixteenth European Conference on Machine Learning (ECML 2005), pages 317-328.
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coding. In Advance in Neural Information Processing Systems 8, pages 1038-1044. MIT Press.
Sutton, R. and Barto, A. (1998). Reinforcement Learning. MIT Press.