Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Bolland, Adrien; Boukas, Ioannis; Berger, Mathias; Ernst, Damien

doi:10.1613/JAIR.1.13350

Download

Article (Scientific journals)

Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Bolland, Adrien; Boukas, Ioannis; Berger, Mathias et al.

2022 • In Journal of Artificial Intelligence Research, 73, p. 117-171

Peer reviewed

Permalink
https://hdl.handle.net/2268/247965

DOI
10.1613/JAIR.1.13350

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

DEPS.pdf

Publisher postprint (2.13 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

reinforcement learning; joint design and control; deep neural networks; microgrid; drone

Abstract :

[en] We consider the joint design and control of discrete-time stochastic dynamical systems over a finite time horizon. We formulate the problem as a multi-step optimization problem under uncertainty seeking to identify a system design and a control policy that jointly maximize the expected sum of rewards collected over the time horizon considered. The transition function, the reward function and the policy are all parametrized, assumed known and differentiable with respect to their parameters. We then introduce a deep reinforcement learning algorithm combining policy gradient methods with model-based optimization techniques to solve this problem. In essence, our algorithm iteratively approximates the gradient of the expected return via Monte-Carlo sampling and automatic differentiation and takes projected gradient ascent steps in the space of environment and policy parameters. This algorithm is referred to as Direct Environment and Policy Search (DEPS). We assess the performance of our algorithm in three environments concerned with the design and control of a mass-spring-damper system, a small-scale off-grid power system and a drone, respectively. In addition, our algorithm is benchmarked against a state-of-the-art deep reinforcement learning algorithm used to tackle joint design and control problems. We show that DEPS performs at least as well or better in all three environments, consistently yielding solutions with higher returns in fewer iterations. Finally, solutions produced by our algorithm are also compared with solutions produced by an algorithm that does not jointly optimize environment and policy parameters, highlighting the fact that higher returns can be achieved when joint optimization is performed.

Disciplines :

Computer science

Author, co-author :

Bolland, Adrien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Boukas, Ioannis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart-Microgrids

Berger, Mathias ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Language :

English

Title :

Jointly Learning Environments and Control Policies with Projected Stochastic Gradient Ascent

Publication date :

January 2022

Journal title :

Journal of Artificial Intelligence Research

Volume :

Pages :

117-171

Peer reviewed :

Peer reviewed

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique

Available on ORBi :

since 03 June 2020

Statistics

Number of views

585 (57 by ULiège)

Number of downloads

256 (31 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Anderson, Michael L. 2003. Embodied cognition: A field guide. Artificial intelligence, 149(1), 91–130.
Andrychowicz, Marcin, Raichuk, Anton, Stańczyk, Piotr, Orsini, Manu, Girgin, Sertan, Marinier, Raphael, Hussenot, Léonard, Geist, Matthieu, Pietquin, Olivier, Michalski, Marcin, et al. 2020. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990.
Bakker, Hannah, Dunke, Fabian, & Nickel, Stefan. 2020. A structuring review on multi-stage optimization under uncertainty: Aligning concepts from theory and practice. Omega, 96, 102080.
Bechtle, Sarah, Lin, Yixin, Rai, Akshara, Righetti, Ludovic, & Meier, Franziska. 2020. Curious ilqr: Resolving uncertainty in model-based rl. Pages 162–171 of: Conference on Robot Learning. PMLR.
Bengio, Yoshua. 2012. Practical recommendations for gradient-based training of deep architectures. Pages 437–478 of: Neural networks: Tricks of the trade. Springer.
Bertsekas, Dimitri P. 2005. Dynamic programming and optimal control. Vol. 1. Athena scientific Belmont, MA.
Birge, John R, & Louveaux, Francois. 2011. Introduction to stochastic programming. Springer Science & Business Media.
Bottou, Léon. 2010. Large-scale machine learning with stochastic gradient descent. Pages 177–186 of: Proceedings of COMPSTAT’2010. Springer.
Boukas, Ioannis, Ernst, Damien, Théate, Thibaut, Bolland, Adrien, Huynen, Alexandre, Buchwald, Martin, Wynants, Christelle, & Cornélusse, Bertrand. 2020. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding. arXiv preprint arXiv:2004.05940.
Bravo-Palacios, Gabriel, Del Prete, Andrea, & Wensing, Patrick M. 2020. One robot for many tasks: Versatile co-design through stochastic programming. IEEE Robotics and Automation Letters, 5(2), 1680–1687.
Brekken, Ted KA, Yokochi, Alex, Von Jouanne, Annette, Yen, Zuan Z, Hapke, Hannes Max, & Halamay, Douglas A. 2010. Optimal energy storage sizing and control for wind power applications. IEEE Transactions on Sustainable Energy, 2(1), 69–77.
Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, & Zaremba, Wojciech. 2016. Openai gym. arXiv preprint arXiv:1606.01540.
Camacho, Eduardo F, & Alba, Carlos Bordons. 2013. Model predictive control. Springer Science & Business Media.
Chen, Tianjian, He, Zhanpeng, & Ciocarlie, Matei. 2020. Hardware as policy: Mechanical and computational co-optimization using deep reinforcement learning. arXiv preprint arXiv:2008.04460.
Chen, Tianjian, He, Zhanpeng, & Ciocarlie, Matei. 2021. Co-designing hardware and control for robot hands. Science Robotics, 6(54).
Cohen, Kobi, Nedic, Angelia, & Srikant, R. 2016. On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression. arXiv preprint arXiv:1606.03000.
Digumarti, Krishnamanaswi M, Gehring, Christian, Coros, Stelian, Hwangbo, J, & Siegwart, Roland. 2014. Concurrent optimization of mechanical design and locomotion control of a legged robot. Pages 315–323 of: Mobile Service Robotics. World Scientific.
Dinev, Traiko, Mastalli, Carlos, Ivan, Vladimir, Tonneau, Steve, & Vijayakumar, Sethu. 2021. Co-Designing Robots by Differentiating Motion Solvers. arXiv preprint arXiv:2103.04660.
François-Lavet, Vincent, Gemine, Quentin, Ernst, Damien, & Fonteneau, Raphaël. 2016. Towards the minimization of the levelized energy costs of microgrids using both long-term and short-term storage devices. Smart Grid: Networking, Data Management, and Business Models, 295–319.
Friedman, Milton. 2007. Price theory. Transaction Publishers.
Goel, Vikas, & Grossmann, Ignacio E. 2006. A class of stochastic programs with decision dependent uncertainty. Mathematical programming, 108(2-3), 355–394.
Grondman, Ivo, Busoniu, Lucian, Lopes, Gabriel AD, & Babuska, Robert. 2012. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1291–1307.
Ha, David. 2019. Reinforcement learning for improving agent design. Artificial life, 25(4), 352–365.
Ha, Sehoon, Coros, Stelian, Alspach, Alexander, Kim, Joohyung, & Yamane, Katsu. 2017. Joint Optimization of Robot Design and Motion Parameters using the Implicit Function Theorem. In: Robotics: Science and systems, vol. 8.
Haarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, & Levine, Sergey. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Pages 1861–1870 of: International conference on machine learning. PMLR.
Habib, Maki K, Abdelaal, Wahied Gharieb Ali, Saad, Mohamed Shawky, et al. 2014. Dynamic modeling and control of a quadrotor using linear and nonlinear approaches.
Heitsch, Holger, & Roemisch, Werner. 2009. Scenario Tree Modelling for Multistage Stochastic Programs. Mathematical Programming, 118, 371–406.
Hodge, Victoria J, Hawkins, Richard, & Alexander, Rob. 2021. Deep reinforcement learning for drone navigation using sensor data. Neural Computing and Applications, 33(6), 2015–2033.
Hwangbo, Jemin, Sa, Inkyu, Siegwart, Roland, & Hutter, Marco. 2017. Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters, 2(4), 2096–2103.
Jackson, Lucy, Walters, Celyn, Eckersley, Steve, Senior, Pete, & Hadfield, Simon. 2021. ORCHID: Optimisation of Robotic Control and Hardware In Design using Reinforcement Learning. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021). University of Surrey.
Jamieson, Kevin G, Nowak, Robert, & Recht, Ben. 2012. Query complexity of derivative-free optimization. Advances in Neural Information Processing Systems, 25, 2672–2680.
Kaelbling, Leslie Pack, Littman, Michael L, & Moore, Andrew W. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research, 4, 237–285.
Kingma, Diederik P, & Ba, Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Li, Q, Zhang, WJ, & Chen, Li. 2001. Design for control - a concurrent engineering approach for mechatronic systems design. IEEE/ASME transactions on mechatronics, 6(2), 161–169.
Luck, Kevin Sebastian, Amor, Heni Ben, & Calandra, Roberto. 2020. Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning. Pages 854–869 of: Conference on Robot Learning. PMLR.
Marufuzzaman, Mohammad, Eksioglu, Sandra D., & (Eric) Huang, Yongxi. 2014. Two-stage stochastic programming supply chain model for biodiesel production via wastewater treatment. Computers & Operations Research, 49, 1 – 17.
Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy, Harley, Tim, Silver, David, & Kavukcuoglu, Koray. 2016. Asynchronous methods for deep reinforcement learning. Pages 1928–1937 of: International conference on machine learning. PMLR.
Moerland, Thomas M, Broekens, Joost, & Jonker, Catholijn M. 2020. Model-based reinforcement learning: A survey. arXiv preprint arXiv:2006.16712.
Nemirovsky, Arkadii Semenovich, & Yudin, David Borisovich. 1983. Problem complexity and method efficiency in optimization.
Oliveto, Pietro S, & Witt, Carsten. 2015. Improved time complexity analysis of the simple genetic algorithm. Theoretical Computer Science, 605, 21–41.
Sabatino, Francesco. 2015. Quadrotor control: modeling, nonlinearcontrol design, and simulation.
Schaff, Charles, Yunis, David, Chakrabarti, Ayan, & Walter, Matthew R. 2019. Jointly learning to construct and control agents using deep reinforcement learning. Pages 9798–9805 of: 2019 International Conference on Robotics and Automation (ICRA). IEEE.
Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael, & Moritz, Philipp. 2015. Trust region policy optimization. Pages 1889–1897 of: International conference on machine learning.
Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, & Klimov, Oleg. 2017. Proximal Policy Optimization Algorithms.
Serban, Iulian Vlad, Sankar, Chinnadhurai, Pieper, Michael, Pineau, Joelle, & Bengio, Yoshua. 2020. The bottleneck simulator: A model-based deep reinforcement learning approach. Journal of Artificial Intelligence Research, 69, 571–612.
Wallace, Stein W, & Fleten, Stein-Erik. 2003. Stochastic programming models in energy. Handbooks in operations research and management science, 10, 637–677.
Wang, Chao, Wang, Jian, Zhang, Xudong, & Zhang, Xiao. 2017. Autonomous navigation of UAV in large-scale unknown complex environment with deep reinforcement learning. Pages 858–862 of: 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). Ieee.
Williams, Ronald J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229–256.
Wu, Ga, Say, Buser, & Sanner, Scott. 2020. Scalable planning with deep neural network learned transition models. Journal of Artificial Intelligence Research, 68, 571–606.
Xiang, Y, & Gong, XG. 2000. Efficiency of generalized simulated annealing. Physical Review E, 62(3), 4473.