[en] This paper introduces an algorithm for direct search of control policies in continuous-state, discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions (BFs), where a discrete action is assigned to each BF. The type of the BFs and their number are specified in advance and determine the complexity of the representation. Considerable flexibility is achieved by optimizing the locations and shapes of the BFs, together with the action assignments. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. The return for each representative state is estimated using Monte Carlo simulations. The resulting algorithm for crossentropy policy search with adaptive BFs is extensively evaluated in problems with two to six state variables, for which it reliably obtains good policies with only a small number of BFs. In these experiments, cross-entropy policy search requires vastly fewer BFs than value-function techniques with equidistant BFs, and outperforms policy search with a competing optimization
algorithm called DIRECT.
Disciplines :
Computer science
Author, co-author :
Busoniu, Lucian
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Babuska, Robert
De Schutter, Bart
Language :
English
Title :
Cross-entropy optimization of control policies with adaptive basis functions
Publication date :
February 2011
Journal title :
IEEE Transactions on Systems, Man and Cybernetics. Part B, Cybernetics
ISSN :
1083-4419
eISSN :
1941-0492
Publisher :
Institute of Electrical and Electronics Engineers, New-York, United States - New York
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.
Bibliography
D. P. Bertsekas, Dynamic Programming and Optimal Control. 3rd ed. Belmont, MA: Athena Scientific, 2007.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996.
J. N. Tsitsiklis and B. Van Roy, "Feature-based methods for large scale dynamic programming," Mach. Learn., vol. 22, no. 1-3, pp. 59-94, Jan.-Mar. 1996.
R. Munos and A. Moore, "Variable-resolution discretization in optimal control," Mach. Learn., vol. 49, no. 2/3, pp. 291-323, Nov. 2002.
M. G. Lagoudakis and R. Parr, "Least-squares policy iteration," J. Mach. Learn. Res., vol. 4, pp. 1107-1149, 2003.
L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuka, "Continuous-state reinforcement learning with fuzzy approximation," in Adaptive Agents and Multi-Agent Systems III, K. Tuyls, A. Nowé, Z. Guessoum, and D. Kudenko, Eds. New York: Springer-Verlag, 2008, ser. Lecture Notes in Computer Science, pp. 27-43.
D. Ernst, P. Geurts, and L. Wehenkel, "Tree-based batch mode reinforcement learning," J. Mach. Learn. Res., vol. 6, pp. 503-556, 2005.
S. Mahadevan and M. Maggioni, "Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes," J. Mach. Learn. Res., vol. 8, pp. 2169-2231, 2007.
P. Marbach and J. N. Tsitsiklis, "Approximate gradient methods in policy-space optimization of Markov reward processes," Discrete Event Dyn. Syst.: Theory Appl., vol. 13, no. 1/2, pp. 111-148, Jan. 2003.
S. Mannor, R. Y. Rubinstein, and Y. Gat, "The cross-entropy method for fast policy search," in Proc. 20th ICML, Washington, DC, Aug. 21-24, 2003, pp. 512-519.
R. Munos, "Policy gradient in continuous time," J. Mach. Learn. Res., vol. 7, pp. 771-791, 2006.
H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, Simulation-Based Algorithms for Markov Decision Processes. New York: Springer-Verlag, 2007.
M. Riedmiller, J. Peters, and S. Schaal, "Evaluation of policy gradient methods and variants on the cart-pole benchmark," in Proc. IEEE Symp. ADPRL, Honolulu, HI, Apr. 1-5, 2007, pp. 254-261.
R. Y. Rubinstein and D. P. Kroese, The Cross Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. New York: Springer-Verlag, 2004.
D. R. Jones, "DIRECT global optimization algorithm," in Encyclopedia of Optimization, C. A. Floudas and P. M. Pardalos, Eds. New York: Springer-Verlag, 2009, pp. 725-735.
A. Y. Ng and M. I. Jordan, "PEGASUS: A policy search method for large MDPs and POMDPs," in Proc. 16th Conf. UAI, Palo Alto, CA, Jun. 30-Jul. 3, 2000, pp. 406-415.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in Advances in Neural Information Processing Systems 12, S. A. Solla, T. K. Leen, and K.-R. Müller, Eds. Cambridge, MA: MIT Press, 2000, pp. 1057-1063.
V. R. Konda and J. N. Tsitsiklis, "On actor-critic algorithms," SIAM J. Control Optim., vol. 42, no. 4, pp. 1143-1166, 2003.
J. Peters and S. Schaal, "Natural actor-critic," Neurocomputing, vol. 71, no. 7-9, pp. 1180-1190, Mar. 2008.
D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, "Adaptive critic learning techniques for engine torque and air-fuel ratio control," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 988-993, Aug. 2008.
H. H. Chin and A. A. Jafari, "Genetic algorithm methods for solving the best stationary policy of finite Markov decision processes," in Proc. 30th Southeastern Symp. Syst. Theory, Morgantown, WV, Mar. 8-10, 1998, pp. 538-543.
D. Barash, "A genetic search in policy space for solving Markov decision processes," in Proc. AAAI Spring Symp. Search Techn. Probl. Solving Under Uncertainty Incomplete Inf., Palo Alto, CA, Mar. 22-24, 1999.
S.-M. Tse, Y. Liang, K.-S. Leung, K.-H. Lee, and T. S.-K. Mok, "A memetic algorithm for multiple-drug cancer chemotherapy schedule optimization," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 1, pp. 84-91, Feb. 2007.
I. Menache, S. Mannor, and N. Shimkin, "Basis function adaptation in temporal difference reinforcement learning," Ann. Oper. Res., vol. 134, no. 1, pp. 215-238, Feb. 2005.
C. G. Atkeson and B. J. Stephens, "Random sampling of states in dynamic programming," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 924-929, Aug. 2008.
M. G. Lagoudakis and R. Parr, "Reinforcement learning as classification: Leveraging modern classifiers," in Proc. 20th ICML, Washington, DC, Aug. 21-24, 2003, pp. 424-431.
D. P. Bertsekas, "Dynamic programming and suboptimal control: A survey from ADP to MPC," Eur. J. Control-Special Issue for the CDC-ECC-05 in Seville, Spain, vol. 11, no. 4/5, pp. 310-334, 2005.
C. Dimitrakakis and M. Lagoudakis, "Rollout sampling approximate policy iteration," Mach. Learn., vol. 72, no. 3, pp. 157-171, Sep. 2008.
L. Buşoniu, D. Ernst, B. De Schutter, and R. Babuka, "Policy search with cross-entropy optimization of basis functions," in Proc. IEEE Int. Symp. ADPRL, Nashville, TN, Mar. 30-Apr., 2, 2009, pp. 153-160.
J. Rust, "Numerical dynamic programming in economics," in Handbook of Computational Economics, H. M. Amman, D. A. Kendrick, and J. Rust, Eds. Amsterdam, The Netherlands: Elsevier, 1996, ch. 14, pp. 619-729.
L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement learning: A survey," J. Artif. Intell. Res., vol. 4, pp. 237-285, 1996.
D. Ormoneit and S. Sen, "Kernel-based reinforcement learning," Mach. Learn., vol. 49, no. 2/3, pp. 161-178, Nov. 2002.
A. Costa, O. D. Jones, and D. Kroese, "Convergence properties of the cross-entropy method for discrete optimization," Oper. Res. Lett., vol. 35, no. 5, pp. 573-580, Sep. 2007. (Pubitemid 47198343)
J. Randløv and P. Alstrøm, "Learning to drive a bicycle using reinforcement learning and shaping," in Proc. 15th ICML, Madison, WI, Jul. 24-27, 1998, pp. 463-471.
B. Adams, H. Banks, H.-D. Kwon, and H. Tran, "Dynamic multidrug therapies for HIV: Optimal and STI control approaches," Math. Biosci. Eng., vol. 1, no. 2, pp. 223-241, 2004.
D. Ernst, G.-B. Stan, J. Gonçalves, and L. Wehenkel, "Clinical data based optimal STI strategies for HIV: A reinforcement learning approach," in Proc. 45th IEEE Conf. Decision Control, San Diego, CA, Dec. 13-15, 2006, pp. 667-672
Similar publications
Sorry the service is unavailable at the moment. Please try again later.
This website uses cookies to improve user experience. Read more
Save & Close
Accept all
Decline all
Show detailsHide details
Cookie declaration
About cookies
Strictly necessary
Performance
Strictly necessary cookies allow core website functionality such as user login and account management. The website cannot be used properly without strictly necessary cookies.
This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.
Performance cookies are used to see how visitors use the website, eg. analytics cookies. Those cookies cannot be used to directly identify a certain visitor.
Used to store the attribution information, the referrer initially used to visit the website
Cookies are small text files that are placed on your computer by websites that you visit. Websites use cookies to help users navigate efficiently and perform certain functions. Cookies that are required for the website to operate properly are allowed to be set without your permission. All other cookies need to be approved before they can be set in the browser.
You can change your consent to cookie usage at any time on our Privacy Policy page.