Inferring bounds on the performance of a control policy from a sample of trajectories

Fonteneau, Raphaël; Murphy, Susan; Wehenkel, Louis; Ernst, Damien

doi:10.1109/ADPRL.2009.4927534

Download

Paper published in a book (Scientific congresses and symposiums)

Inferring bounds on the performance of a control policy from a sample of trajectories

Fonteneau, Raphaël; Murphy, Susan; Wehenkel, Louis et al.

2009 • In Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)

Peer reviewed

Permalink
https://hdl.handle.net/2268/13667

DOI
10.1109/ADPRL.2009.4927534

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

bounds-trajectories-adprl.pdf

Publisher postprint (255.42 kB)

Download

Annexes

RL-TU-Delft-2009.pdf

Publisher postprint (299.94 kB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

reinforcement learning; model-free; lower bound on a policy; performance guarantee

Abstract :

[en] We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this paper, the dynamics, control policy, and reward function are supposed to be deterministic and Lipschitz continuous. Under these assumptions, a polynomial algorithm, in terms of the sample size and length of the optimization horizon, is derived to compute these bounds, and their tightness is characterized in terms of the sample density.

Disciplines :

Computer science

Author, co-author :

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Murphy, Susan

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

Inferring bounds on the performance of a control policy from a sample of trajectories

Publication date :

2009

Event name :

IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)

Event place :

Nashville, United States

Event date :

March 30 - April 2, 2009

Audience :

International

Main work title :

Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)

ISBN/EAN :

978-1-4244-2761-1

Pages :

117-123

Peer review/Selection committee :

Peer reviewed

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique

Available on ORBi :

since 03 June 2009

Statistics

Number of views

215 (11 by ULiège)

Number of downloads

425 (8 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

D.P. Bertsekas. Dynamic Programming and Optimal Control, volume III. Athena Scientific, Belmont, MA, 2nd edition, 2005.
D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
E.F. Camacho and C. Bordons. Model Predictive Control. Springer, 2004.
D. Ernst. Selecting concise sets of samples for a reinforcement learning agent. In Proceedings of the Third International Conference on Computational Intelligence, Robotics and Autonomous Systems (CIRAS 2005), page 6, 2005.
D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503-556, 2005. (Pubitemid 40958851)
J.E. Ingersoll. Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc., 1987.
M. Kearns and S. Singh. Finite-sample convergence rates for Qlearning and indirect algorithms. In In Neural Information Processing Systems 12, pages 996-1002. MIT Press, 1999.
M. Lagoudakis and R. Parr. Least-squares policy iteration. Jounal of Machine Learning Research, 4:1107-1149, 2003.
S.A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, 65(2):331-366, 2003. (Pubitemid 36607793)
S.A. Murphy. An experimental design for the development of adaptative treatment strategies. Statistics in Medicine, 24:1455-1481, 2005. (Pubitemid 40716347)
D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49(2-3):161-178, 2002. (Pubitemid 34325684)
R.E. Schapire. On the worst-case analysis of temporal-difference learning algorithms. Machine Learning, 22(1/2/3), 1996. (Pubitemid 126724364)
R.S. Sutton and A.G. Barto. Reinforcement Learning, an Introduction. MIT Press, 1998.
A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260-269, 1967.