Understanding the influence of exploration on the dynamics of policy-gradient algorithms

Bolland, Adrien

Scientific conference in universities or research centers (Scientific conferences in universities or research centers)

Bolland, Adrien

2024

Permalink
https://hdl.handle.net/2268/311088

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

Understanding the influence of exploration on the dynamics of policy-gradient algorithms.pdf

Author postprint (1.31 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Reinforcement Learning; Policy Gradient; Exploration

Abstract :

[en] Policy gradients are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses. s are effective reinforcement learning algorithms for solving complex control problems. To compute near-optimal policies, it is nevertheless essential in practice to ensure that the variance of the policy remains sufficiently large and that the states are visited sufficiently often during the optimization procedure. Doing so is usually referred to as exploration and is often implemented in practice by adding intrinsic exploration bonuses to the rewards in the learning objective. We propose to analyze the influence of the variance of policies on the return, and the influence of these exploration bonuses on the policy gradient optimization procedure. First, we show an equivalence between optimizing stochastic policies by policy gradient and deterministic policies by continuation (i.e., by smoothing the policy parameters during the optimization). We then argue that the variance of policies acts as a smoothing hyperparameter to avoid local extrema during the optimization. Second, we study the learning objective when intrinsic exploration bonuses are added to the rewards. We show that adding these bonuses makes it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Furthermore, computing gradient estimates with these reward bonuses leads to policy gradient algorithms with higher probabilities to eventually provide an optimal policy. In light of these two effects, we discuss and illustrate empirically typical exploration strategies based on entropy bonuses.

Disciplines :

Computer science

Author, co-author :

Bolland, Adrien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids

Language :

English

Title :

Understanding the influence of exploration on the dynamics of policy-gradient algorithms

Publication date :

11 January 2024

Event name :

Presentation at the Mathematical Institute of the University of Mannheim

Event place :

Mannheim, Germany

Event date :

January 11th, 2024

Audience :

International

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique

Available on ORBi :

since 11 January 2024

Statistics

Number of views

73 (10 by ULiège)

Number of downloads

35 (4 by ULiège)

More statistics