Off-Policy Maximum Entropy RL with Future State and Action Visitation  Measures

[en] We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.

Disciplines :

Computer science

Author, co-author :

Bolland, Adrien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Lambrechts, Gaspard ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids

Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids

Language :

English

Title :

Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures

Publication date :

2024

Source :

ArXiv

Available on ORBi :

since 11 December 2024

Statistics

Number of views

19 (8 by ULiège)

Number of downloads

4 (1 by ULiège)

More statistics