[en] In this paper, we propose an extension to the policy gradient algorithms by allowing starting states to be sampled from a probability distribution that may differ from the one used to specify the reinforcement learning task. In particular, we suggest that, between policy updates, starting states should be sampled from a probability density function which approximates the state visitation frequency of the current policy. Results generated from various environments clearly demonstrate a performance improvement in terms of mean cumulative rewards and substantial update stability compared to vanilla policy gradient algorithms where the starting state distributions are either as specified by the environment or uniform distributions over the state space. A sensitivity analysis over a subset of the hyper-parameters of our algorithm also suggests that they should be adapted after each policy update to maximise the improvements of the policies.
Disciplines :
Computer science Computer science
Author, co-author :
Aittahar, Samy ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Language :
English
Title :
Empirical Analysis of Policy Gradient Algorithms where Starting States are Sampled accordingly to Most Frequently Visited States
Publication date :
2020
Event name :
International Federation of Automatic Control (IFAC) 2020
Atiya, A., Parlos, A., and Ingber, L. (2004). A reinforcement learning method based on adaptive simulated annealing. volume 1, 121 - 124 Vol. 1. doi:10.1109/ MWSCAS.2003.1562233.
Bou Ammar, H. and Taylor, M.E. (2012). Reinforcement learning transfer via common subspaces. In P. Vrancx, M. Knudson, and M. Grzes (eds.), Adaptive and Learning Agents, 21-36. Springer Berlin Heidelberg, Berlin, Heidelberg.
Busoniu, L., Ernst, D., De Schutter, B., and Babuska, R. (2010). Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 196-209.
Florensa, C., Held, D., Wulfmeier, M., and Abbeel, P. (2017). Reverse curriculum generation for reinforcement learning. CoRR, abs/1707.05300. URL http://arxiv.org/abs/1707.05300.
Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. (2016). Continuous deep Q-learning with model-based acceleration. ArXiv e-prints.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2017). Deep reinforcement learning that matters.
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., and Silver, D. (2017). Rainbow: combining improvements in deep reinforcement learning. CoRR, abs/1710.02298. URL http://arxiv.org/abs/1710. 02298.
Huang, C., Krueger, D., Lacoste, A., and Courville, A.C. (2018). Neural autoregressive flows. CoRR, abs/1804.00779. URL http://arxiv.org/abs/1804. 00779.
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv e-prints.
Mannor, S., Rubinstein, R.Y., and Gat, Y. (2003). The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 512-519.
McLachlan, G.J. and Peel, D. (2000). Finite mixture models. Wiley Series in Probability and Statistics, New York.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Pirotta, M. Restelli, M. and Bascetta, L. (2015). Policy gradient in Lipschitz Markov decision processes. Machine Learning, 100(2), 255-283. doi:10.1007/s10994-015-5484-1. URL https://doi.org/10.1007/s10994-015-5484-1.
Popov, I., Heess, N., Lillicrap, T.P., Hafner, R., Barth-Maron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., and Riedmiller, M.A. (2017). Data-efficient deep reinforcement learning for dexterous manipulation. CoRR, abs/1704.03073. URL http://arxiv.org/abs/1704. 03073.
Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS'07, 1177-1184. Curran Associates Inc., USA. URL http://dl.acm.org/citation. cfm?id=2981562.2981710.
R.Munos (2014). From bandits to Monte-Carlo Tree Search: the optimistic principle applied to optimization and planning. Foundations and Trends R in Machine Learning, 7(1), 1-129.
Salimans, T. and Chen, R. (2018). Learning Montezuma's Revenge from a single demonstration. CoRR, abs/1812.03381. URL http://arxiv.org/abs/1812. 03381.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., and Abbeel, P. (2015). Trust region policy optimization. ArXiv e-prints.
Schulman, J., Moritz, P., Levine, S., Jordan, M.I., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. CoRR, abs/1506.02438. URL http://arxiv.org/abs/1506. 02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. ArXiv e-prints.
Such, F.P., Madhavan, V., Conti, E., Lehman, J., Stanley, K.O., and Clune, J. (2017). Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. CoRR, abs/1712.06567. URL http://arxiv.org/abs/1712.06567.
Sutton, R.S., McAllester, D.A., Singh, S.P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, 1057-1063.
Trovo, F., Paladino, S., Restelli, M., and Gatti, N. (2016). Budgeted multi-armed bandit in continuous action space. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, 560-568. IOS Press.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actor-critic with experience replay. ArXiv e-prints.
Weaver, L. and Tao, N. (2013). The optimal reward baseline for gradient-based reinforcement learning. CoRR, abs/1301.2315. URL http://arxiv.org/abs/1301. 2315.
Wehenkel, A. and Louppe, G. (2019). Unconstrained monotonic neural networks. URL https://arxiv.org/abs/1908.05164.