Doctoral thesis (Dissertations and theses)
Reinforcement Learning in Partially Observable Markov Decision Processes: Learning to Remember the Past by Learning to Predict the Future
Lambrechts, Gaspard
2025
 

Files


Full Text
rl-in-pomdp.pdf
Author postprint (5.87 MB)
Download
Annexes
rl-for-pomdp.pdf
(2.44 MB) Creative Commons License - Attribution
Slides
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
Reinforcement Learning; RL; Partially Observable Markov Decision Process; POMDP; Recurrent Neural Network; RNN; Model-Based Reinforcement Learning; MBRL; World Model; Asymmetric Learning; Representation Learning; State Space Model; SSM; Belief; Optimal Control; Partial Observability; Convergence Analysis; Finite-Time Bound; Agent State; Informed POMDP
Abstract :
[en] Intelligence is usually understood as the ability to make decisions, based on perception, in order to achieve objectives. In other words, intelligence is about perceiving and abstracting past information about the world for then acting on its future execution. This thesis focuses on reinforcement learning in partially observable Markov decision processes for learning intelligent behaviors through interaction. In particular, this manuscript explores and emphasizes the interplay between perception, representations, memory, predictions and decisions. After introducing the theoretical foundations, the core contributions of the thesis are presented across three thematic parts. The first part, “Learning and Remembering,” investigates how learning intelligent behaviors improves memory and vice versa. To begin with, it studies how learning to act optimally results in representations of the perception history that encode the posterior distribution over the states, known as the belief. Next, it studies how long-term memory improves the ability to learn intelligent behaviors, by designing an initialization procedure for recurrent neural networks that endows them with long-term memorization abilities. The second part, “Leveraging Additional Information,” explores how additional information about the world can be used to learn intelligent behaviors faster than when learning from perception only. It starts by empirically showing that world models predicting this additional information provide better history representations and faster learning. Then, it provides a theoretical justification for the improved convergence speed of a particular algorithm that leverages this information, namely the asymmetric actor-critic algorithm. The third part, “Entangling Predictions and Decisions,” proposes several architectural innovations for obtaining world models that efficiently generate trajectories. First, it develops new sequence models that parallelize autoregressive generation, while being implicitly recurrent to allow resuming generation. Afterwards, it elaborates on their use as new world models that are able to generate trajectories in parallel through specific latent policies. Finally, this thesis concludes by summarizing how learning adequate representations of the perception history is paramount to learning to make decisions under partial observability. In the perspective of developing general intelligence, this thesis also motivates the shift from specialized abstractions to generalizable abstractions extending across diverse environments.
Disciplines :
Computer science
Author, co-author :
Lambrechts, Gaspard ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Language :
English
Title :
Reinforcement Learning in Partially Observable Markov Decision Processes: Learning to Remember the Past by Learning to Predict the Future
Defense date :
07 April 2025
Institution :
ULiège - University of Liège [School of Engineering], Liège, Belgium
Degree :
Doctorate in Engineering Sciences and Technology (Electrical, Electronic and Computer Engineering)
Promotor :
Ernst, Damien  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Drion, Guillaume ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
President :
Geurts, Pierre  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Jury member :
Mahajan, Aditya;  McGill University > Department of Electrical and Computer Engineering
François-Lavet, Vincent;  VU - Vrije Universiteit Amsterdam > Faculty of Science
Spaan, Matthijs;  Technische Universiteit Delft > Faculty of Engineering, Mathematics and Computer Science
Louppe, Gilles  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Fonteneau, Raphaël  ;  Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Tags :
CÉCI : Consortium des Équipements de Calcul Intensif
Tier-1 supercomputer
Available on ORBi :
since 25 February 2025

Statistics


Number of views
367 (92 by ULiège)
Number of downloads
309 (43 by ULiège)

Bibliography


Similar publications



Contact ORBi