Reinforcement Learning in Partially Observable Markov Decision Processes: Learning to Remember the Past by Learning to Predict the Future

Lambrechts, Gaspard

Doctoral thesis (Dissertations and theses)

Lambrechts, Gaspard

2025

Permalink
https://hdl.handle.net/2268/328700

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

rl-in-pomdp.pdf

Author postprint (5.87 MB)

Download

Annexes

rl-for-pomdp.pdf

(2.44 MB)

Slides

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Reinforcement Learning; RL; Partially Observable Markov Decision Process; POMDP; Recurrent Neural Network; RNN; Model-Based Reinforcement Learning; MBRL; World Model; Asymmetric Learning; Representation Learning; State Space Model; SSM; Belief; Optimal Control; Partial Observability; Convergence Analysis; Finite-Time Bound; Agent State; Informed POMDP

Abstract :

[en] Intelligence is usually understood as the ability to make decisions, based on perception, in order to achieve objectives. In other words, intelligence is about perceiving and abstracting past information about the world for then acting on its future execution. This thesis focuses on reinforcement learning in partially observable Markov decision processes for learning intelligent behaviors through interaction. In particular, this manuscript explores and emphasizes the interplay between perception, representations, memory, predictions and decisions. After introducing the theoretical foundations, the core contributions of the thesis are presented across three thematic parts. The first part, “Learning and Remembering,” investigates how learning intelligent behaviors improves memory and vice versa. To begin with, it studies how learning to act optimally results in representations of the perception history that encode the posterior distribution over the states, known as the belief. Next, it studies how long-term memory improves the ability to learn intelligent behaviors, by designing an initialization procedure for recurrent neural networks that endows them with long-term memorization abilities. The second part, “Leveraging Additional Information,” explores how additional information about the world can be used to learn intelligent behaviors faster than when learning from perception only. It starts by empirically showing that world models predicting this additional information provide better history representations and faster learning. Then, it provides a theoretical justification for the improved convergence speed of a particular algorithm that leverages this information, namely the asymmetric actor-critic algorithm. The third part, “Entangling Predictions and Decisions,” proposes several architectural innovations for obtaining world models that efficiently generate trajectories. First, it develops new sequence models that parallelize autoregressive generation, while being implicitly recurrent to allow resuming generation. Afterwards, it elaborates on their use as new world models that are able to generate trajectories in parallel through specific latent policies. Finally, this thesis concludes by summarizing how learning adequate representations of the perception history is paramount to learning to make decisions under partial observability. In the perspective of developing general intelligence, this thesis also motivates the shift from specialized abstractions to generalizable abstractions extending across diverse environments.

Disciplines :

Computer science

Author, co-author :

Lambrechts, Gaspard ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Language :

English

Title :

Reinforcement Learning in Partially Observable Markov Decision Processes: Learning to Remember the Past by Learning to Predict the Future

Defense date :

07 April 2025

Institution :

ULiège - University of Liège [School of Engineering], Liège, Belgium

Degree :

Doctorate in Engineering Sciences and Technology (Electrical, Electronic and Computer Engineering)

Promotor :

Ernst, Damien ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Drion, Guillaume ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

President :

Geurts, Pierre ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Jury member :

Mahajan, Aditya; McGill University > Department of Electrical and Computer Engineering

François-Lavet, Vincent; VU - Vrije Universiteit Amsterdam > Faculty of Science

Spaan, Matthijs; Technische Universiteit Delft > Faculty of Engineering, Mathematics and Computer Science

Louppe, Gilles ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Fonteneau, Raphaël ; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science

Tags :

CÉCI : Consortium des Équipements de Calcul Intensif
Tier-1 supercomputer

Funders :

F.R.S.-FNRS - Fund for Scientific Research

Available on ORBi :

since 25 February 2025

Statistics

Number of views

447 (108 by ULiège)

Number of downloads

667 (51 by ULiège)

More statistics