[en] We introduce a new maximum entropy reinforcement learning framework based on
the distribution of states and actions visited by a policy. More precisely, an
intrinsic reward function is added to the reward function of the Markov
decision process that shall be controlled. For each state and action, this
intrinsic reward is the relative entropy of the discounted distribution of
states and actions (or features from these states and actions) visited during
the next time steps. We first prove that an optimal exploration policy, which
maximizes the expected discounted sum of intrinsic rewards, is also a policy
that maximizes a lower bound on the state-action value function of the decision
process under some assumptions. We also prove that the visitation distribution
used in the intrinsic reward definition is the fixed point of a contraction
operator. Following, we describe how to adapt existing algorithms to learn this
fixed point and compute the intrinsic rewards to enhance exploration. A new
practical off-policy maximum entropy reinforcement learning algorithm is
finally introduced. Empirically, exploration policies have good state-action
space coverage, and high-performing control policies are computed efficiently.
Disciplines :
Computer science
Author, co-author :
Bolland, Adrien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Lambrechts, Gaspard ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Language :
English
Title :
Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures