Empirical Analysis of Policy Gradient Algorithms where Starting States are Sampled accordingly to Most Frequently Visited States

Aittahar, Samy; Fonteneau, Raphaël; Ernst, Damien

doi:10.1016/j.ifacol.2020.12.2279

Paper published in a journal (Scientific congresses and symposiums)

Empirical Analysis of Policy Gradient Algorithms where Starting States are Sampled accordingly to Most Frequently Visited States

Aittahar, Samy; Fonteneau, Raphaël; Ernst, Damien

2020 • In IFAC-PapersOnLine, 53 (2), p. 8097–8104

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/246076

DOI
10.1016/j.ifacol.2020.12.2279

Files (4)Send to Details Statistics Bibliography Similar publications

Files

Full Text

1-s2.0-S2405896320329396-main.pdf

Publisher postprint (641.34 kB)

Download

Annexes

teaser_slide_so_sober.pdf

Publisher postprint (93.42 kB)

Teaser slide for the IFAC conference

Download

ifac_mcp0_presentation.pdf

Publisher postprint (816.51 kB)

Presentation slides for the IFAC conference

Download

youtube_link.txt

Publisher postprint (49 B)

Presentation video available on Youtube https://www.youtube.com/watch?v=TA3_vaZWP20&t=2s (copy/paste the link into your browser's address bar)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Reinforcement learning; Control; Policy gradient

Abstract :

[en] In this paper, we propose an extension to the policy gradient algorithms by allowing starting states to be sampled from a probability distribution that may differ from the one used to specify the reinforcement learning task. In particular, we suggest that, between policy updates, starting states should be sampled from a probability density function which approximates the state visitation frequency of the current policy. Results generated from various environments clearly demonstrate a performance improvement in terms of mean cumulative rewards and substantial update stability compared to vanilla policy gradient algorithms where the starting state distributions are either as specified by the environment or uniform distributions over the state space. A sensitivity analysis over a subset of the hyper-parameters of our algorithm also suggests that they should be adapted after each policy update to maximise the improvements of the policies.

Disciplines :

Computer science

Author, co-author :

Aittahar, Samy ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Language :

English

Title :

Empirical Analysis of Policy Gradient Algorithms where Starting States are Sampled accordingly to Most Frequently Visited States

Publication date :

2020

Event name :

International Federation of Automatic Control (IFAC) 2020

Event date :

From 11-07-2020 to 17-07-2020

Audience :

International

Journal title :

IFAC-PapersOnLine

ISSN :

2405-8971

eISSN :

2405-8963

Publisher :

Elsevier, Kidlington, United Kingdom

Volume :

Issue :

Pages :

8097–8104

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 18 March 2020

Statistics

Number of views

218 (35 by ULiège)

Number of downloads

464 (22 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Atiya, A., Parlos, A., and Ingber, L. (2004). A reinforcement learning method based on adaptive simulated annealing. volume 1, 121 - 124 Vol. 1. doi:10.1109/ MWSCAS.2003.1562233.
Bou Ammar, H. and Taylor, M.E. (2012). Reinforcement learning transfer via common subspaces. In P. Vrancx, M. Knudson, and M. Grzes (eds.), Adaptive and Learning Agents, 21-36. Springer Berlin Heidelberg, Berlin, Heidelberg.
Busoniu, L., Ernst, D., De Schutter, B., and Babuska, R. (2010). Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 196-209.
Florensa, C., Held, D., Wulfmeier, M., and Abbeel, P. (2017). Reverse curriculum generation for reinforcement learning. CoRR, abs/1707.05300. URL http://arxiv.org/abs/1707.05300.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous deep Q-learning with model-based acceleration. ArXiv e-prints.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2017). Deep reinforcement learning that matters.
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., and Silver, D. (2017). Rainbow: combining improvements in deep reinforcement learning. CoRR, abs/1710.02298. URL http://arxiv.org/abs/1710. 02298.
Huang, C., Krueger, D., Lacoste, A., and Courville, A.C. (2018). Neural autoregressive flows. CoRR, abs/1804.00779. URL http://arxiv.org/abs/1804. 00779.
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv e-prints.
Mannor, S., Rubinstein, R.Y., and Gat, Y. (2003). The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 512-519.
McLachlan, G.J. and Peel, D. (2000). Finite mixture models. Wiley Series in Probability and Statistics, New York.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Pirotta, M. Restelli, M. and Bascetta, L. (2015). Policy gradient in Lipschitz Markov decision processes. Machine Learning, 100(2), 255-283. doi:10.1007/s10994-015-5484-1. URL https://doi.org/10.1007/s10994-015-5484-1.
Popov, I., Heess, N., Lillicrap, T.P., Hafner, R., Barth-Maron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., and Riedmiller, M.A. (2017). Data-efficient deep reinforcement learning for dexterous manipulation. CoRR, abs/1704.03073. URL http://arxiv.org/abs/1704. 03073.
Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS'07, 1177-1184. Curran Associates Inc., USA. URL http://dl.acm.org/citation. cfm?id=2981562.2981710.
R.Munos (2014). From bandits to Monte-Carlo Tree Search: the optimistic principle applied to optimization and planning. Foundations and Trends R in Machine Learning, 7(1), 1-129.
Salimans, T. and Chen, R. (2018). Learning Montezuma's Revenge from a single demonstration. CoRR, abs/1812.03381. URL http://arxiv.org/abs/1812. 03381.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., and Abbeel, P. (2015). Trust region policy optimization. ArXiv e-prints.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438. URL http://arxiv.org/abs/1506. 02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. ArXiv e-prints.
Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., and Clune, J. (2017). Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. CoRR, abs/1712.06567. URL http://arxiv.org/abs/1712.06567.
Sutton, R.S., McAllester, D.A., Singh, S.P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057-1063.
Trovo, F., Paladino, S., Restelli, M., and Gatti, N. (2016). Budgeted multi-armed bandit in continuous action space. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, 560-568. IOS Press.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actor-critic with experience replay. ArXiv e-prints.
Weaver, L. and Tao, N. (2013). The optimal reward baseline for gradient-based reinforcement learning. CoRR, abs/1301.2315. URL http://arxiv.org/abs/1301. 2315.
Wehenkel, A. and Louppe, G. (2019). Unconstrained monotonic neural networks. URL https://arxiv.org/abs/1908.05164.
Williams, R.J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229-256. doi:10. 1007/BF00992696. URL https://doi.org/10.1007/ BF00992696.