Policy search with cross-entropy optimization of basis functions

[en] This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc parameterizations developed for specific MDPs. In contrast, the novel algorithm employs a flexible policy parameterization, suitable for solving general discrete-action MDPs. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function. The locations and shapes of the basis functions are optimized, together with the action assignments. This allows a large class of policies to be represented. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. We report simulation experiments in which the algorithm reliably obtains good policies with only a small number of basis functions, albeit at sizable computational costs.

Disciplines :

Computer science

Author, co-author :

Busoniu, Lucian

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

De Schutter, Bart

Babuska, Robert

Language :

English

Title :

Policy search with cross-entropy optimization of basis functions

Publication date :

2009

Event name :

IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)

Event place :

Nashville, United States

Event date :

March 30 - April 2, 2009

Audience :

International

Main work title :

Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)

ISBN/EAN :

978-1-4244-2761-1

Pages :

153-160

Peer reviewed :

Peer reviewed

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique

Available on ORBi :

since 03 June 2009

Statistics

Number of views

75 (10 by ULiège)

Number of downloads

246 (2 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

publications

supporting

mentioning

contrasting

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Bibliography

D.P. Bertsekas, Dynamic Programming and Optimal Control, 3rd ed. Athena Scientific, 2007, Vol. 2.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
R. Munos and A. Moore, "Variable-resolution discretization in optimal control," Machine Learning, Vol. 49, no. 2-3, pp. 291-323, 2002.
M. G. Lagoudakis and R. Parr, "Least-squares policy iteration," Journal of Machine Learning Research, Vol. 4, pp. 1107-1149, 2003.
D. Ernst, P. Geurts, and L. Wehenkel, "Tree-based batch mode reinforcement learning," Journal of Machine Learning Research, Vol. 6, pp. 503-556, 2005. (Pubitemid 40958851)
S. Mahadevan and M. Maggioni, "Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes," Journal of Machine Learning Research, Vol. 8, pp. 2169-2231, 2007. (Pubitemid 350046199)
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R. Müller, Eds. MIT Press, 2000, pp. 1057-1063.
P. Marbach and J. N. Tsitsiklis, "Approximate gradient methods in policy-space optimization of Markov reward processes," Discrete Event Dynamic Systems: Theory and Applications, Vol. 13, pp. 111-148, 2003.
R. Munos, "Policy gradient in continuous time," Journal of Machine Learning Research, Vol. 7, pp. 771-791, 2006.
S. Mannor, R. Y. Rubinstein, and Y. Gat, "The cross-entropy method for fast policy search," in Proceedings 20th International Conference on Machine Learning (ICML-03), Washington, US, 21-24 August 2003, pp. 512-519.
H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, Simulation-Based Algorithms for Markov Decision Processes. Springer, 2007.
S. Whiteson and P. Stone, "Evolutionary function approximation for reinforcement learning," Journal of Machine Learning Research, Vol. 7, pp. 877-917, 2006. (Pubitemid 43736560)
R. Y. Rubinstein and D. P. Kroese, The Cross Entropy Method. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, ser. Information Science and Statistics, M. Jordan, J. Kleinberg, B. Scholkopf, F. Kelly, and I. Witten, Eds. Springer, 2004.
V. R. Konda and J. N. Tsitsiklis, "On actor-critic algorithms," SIAM Journal on Control and Optimization, Vol. 42, no. 4, pp. 1143-1166, 2003.
A. Costa, O. D. Jones, and D. Kroese, "Convergence properties of the cross-entropy method for discrete optimization," Operations Research Letters, Vol. 35, pp. 573-580, 2007. (Pubitemid 47198343)
L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuška, "Continuousstate reinforcement learning with fuzzy approximation," in Adaptive Agents and Multi-Agent Systems III, ser. Lecture Notes in Computer Science, K. Tuyls, A. Nowé, Z. Guessoum, and D. Kudenko, Eds. Springer, 2008, Vol. 4865, pp. 27-43.
J. Randløv and P. Alstrøm, "Learning to drive a bicycle using reinforcement learning and shaping," in Proceedings 15th International Conference on Machine Learning (ICML-98), Madison, US, 24-27 July 1998, pp. 463-471.

Similar publications

Sorry the service is unavailable at the moment. Please try again later.

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website