Bolland, Adrien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Louppe, Gilles ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Big Data
Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids ; Telecom Paris, Institut Polytechnique de Paris > Laboratoire Traitement et Communication de l'Information (LTCI)
Language :
English
Title :
Policy Gradient Algorithms Implicitly Optimize by Continuation
Publication date :
21 October 2023
Journal title :
Transactions on Machine Learning Research
eISSN :
2835-8856
Publisher :
OpenReview, Amherst, United States - Massachusetts
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64–66. PMLR, 2020.
Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151–160. PMLR, 2019.
Eugene L Allgower and Kurt Georg. Numerical continuation methods: an introduction, volume 13. Springer Series in Computational Mathematics, 1980.
Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020.
Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, and Alec Koppel. On the sample complexity and metastability of heavy-tailed policy search in continuous control. arXiv preprint arXiv:2106.08414, 2021.
Amrit Singh Bedi, Souradip Chakraborty, Anjaly Parayil, Brian M Sadler, Pratap Tokekar, and Alec Koppel. On the hidden biases of policy mirror ascent in continuous action spaces. In International Conference on Machine Learning, pp. 1716–1731. PMLR, 2022.
Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1): 1–127, 2009.
Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
Sujay Bhatt, Alec Koppel, and Vikram Krishnamurthy. Policy gradient using weak derivatives for reinforcement learning. In Conference on Decision and Control (CDC), volume 58, pp. 5531–5537. IEEE, 2019.
Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
Andrew Blake and Andrew Zisserman. Visual reconstruction. MIT press, 1987.
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-STAT’2010, pp. 177–186. Springer, 2010.
Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, and Pedro Ortega. Your policy regularizer is secretly an adversary. arXiv preprint arXiv:2203.12592, 2022.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement learning and dynamic programming using function approximators. CRC press, 2017.
Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578, 2022.
Herman Chernoff and Lincoln E Moses. Elementary decision theory. Courier Corporation, 2012.
Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, pp. 834–843. PMLR, 2017.
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. PMLR, 2016.
Yasuhiro Fujita and Shin-ichi Maeda. Clipped action policy gradient. In International Conference on Machine Learning, pp. 1597–1606. PMLR, 2018.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Alaa Saade, Shantanu Thakoor, Bilal Piot, Bernardo Avila Pires, Michal Valko, Thomas Mesnard, Tor Lattimore, and Rémi Munos. Geometric entropic exploration. arXiv preprint arXiv:2101.02055, 2021.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2019.
Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In International Conference on Machine Learning, pp. 1833–1841. PMLR, 2016.
Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy explo-ration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019.
Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. Advances in Neural Information Processing Systems, 31, 2018.
Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized policies are reward robust. In International Conference on Artificial Intelligence and Statistics, pp. 64–72. PMLR, 2021.
Riashat Islam, Zafarali Ahmed, and Doina Precup. Marginalized state distribution entropy regularization in policy optimization. arXiv preprint arXiv:1912.05128, 2019.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. PMLR, 2016.
Hossein Mobahi and John W Fisher. On the link between gaussian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 43–56. Springer, 2015.
Hossein Mobahi and John Fisher III. A theoretical analysis of optimization by gaussian continuation. In Conference on Artificial Intelligence, volume 29. AAAI, 2015.
Hossein Mobahi, C Lawrence Zitnick, and Yi Ma. Seeing through the blur. In Conference on Computer Vision and Pattern Recognition, pp. 1736–1743. IEEE, 2012.
Walter Murray and Kien-Ming Ng. An algorithm for nonlinear optimization problems with binary variables. Computational optimization and applications, 47(2):257–288, 2010.
Mirco Mutti, Riccardo De Santi, and Marcello Restelli. The importance of non-markovianity in maximum state entropy exploration. arXiv preprint arXiv:2202.03060, 2022.
Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving policy gradient by exploring under-appreciated rewards. arXiv preprint arXiv:1611.09321, 2016.
Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
Matteo Papini, Andrea Battistello, and Marcello Restelli. Balancing learning speed and stability in policy gradient via adaptive exploration. In International conference on artificial intelligence and statistics, pp. 1188–1199. PMLR, 2020.
Harsh Nilesh Pathak and Randy Paffenroth. Parameter continuation methods for the optimization of deep neural networks. In International Conference on Machine Learning And Applications (ICMLA), volume 18, pp. 1637–1643. IEEE, 2019.
Gandharv Patil, Aditya Mahajan, and Doina Precup. On learning history based policies for controlling markov decision processes. arXiv preprint arXiv:2211.03011, 2022.
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International conference on Machine learning, volume 24, pp. 745–750, 2007.
Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. Advances in Neural Information Processing Systems, 30, 2017.
Calyampudi Radhakrishna Rao. Linear statistical inference and its applications, volume 2. Wiley New York, 1973.
Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. PMLR, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.
Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Conference on Artificial Intelligence, volume 34, pp. 5668–5675. AAAI, 2020.
Weijia Shao, Christian Geißler, and Fikret Sivrikaya. Graduated optimization of black-box functions. arXiv preprint arXiv:1906.01279, 2019.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. PMLR, 2014.
Herbert A Simon. A behavioral model of rational choice. The quarterly journal of economics, 69(1):99–118, 1955.
Minoru Siotani. Some applications of loewner’s ordering on symmetric matrices. Annals of the Institute of Statistical Mathematics, 19:245–259, 1967.
Joe Staines and David Barber. Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Layne T Watson and Raphael T Haftka. Modern homotopy methods in optimization. Computer Methods in Applied Mechanics and Engineering, 74(3):289–305, 1989.
Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Episodic reinforcement learning by logistic reward-weighted regression. In International Conference on Artificial Neural Networks, pp. 407– 416. Springer, 2008.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algo-rithms. Connection Science, 3(3):241–268, 1991.
Huaqing Xiong, Tengyu Xu, Lin Zhao, Yingbin Liang, and Wei Zhang. Deterministic policy gradient: Convergence analysis. In Conference on Uncertainty in Artificial Intelligence, volume 28, 2022.
Jiaxin Zhang, Hoang Tran, Dan Lu, and Guannan Zhang. A novel evolution strategy with directional gaussian smoothing for blackbox optimization. arXiv preprint arXiv:2002.03001, 2020a.
Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with reinforce. In Conference on Artificial Intelligence, volume 35, pp. 10887–10895. AAAI, 2021.
Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020b.