Scientific conference in universities or research centers (Scientific conferences in universities or research centers)
Understanding the influence of exploration on the dynamics of policy-gradient algorithms
Bolland, Adrien
2024
 

Files


Full Text
Understanding the influence of exploration on the dynamics of policy-gradient algorithms.pdf
Author postprint (1.31 MB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
Reinforcement Learning; Policy Gradient; Exploration
Abstract :
[en] Policy gradients are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses. s are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses.
Disciplines :
Computer science
Author, co-author :
Bolland, Adrien ;  Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Language :
English
Title :
Understanding the influence of exploration on the dynamics of policy-gradient algorithms
Publication date :
11 January 2024
Event name :
Presentation at the Mathematical Institute of the University of Mannheim
Event place :
Mannheim, Germany
Event date :
January 11th, 2024
Audience :
International
Funders :
F.R.S.-FNRS - Fund for Scientific Research [BE]
Available on ORBi :
since 11 January 2024

Statistics


Number of views
56 (10 by ULiège)
Number of downloads
22 (4 by ULiège)

Bibliography


Similar publications



Contact ORBi