Stratégies d'échantillonnage pour l'apprentissage par renforcement batch

Fonteneau, Raphaël; Murphy, Susan A.; Wehenkel, Louis; Ernst, Damien

doi:10.3166/RIA.27.171-194

Download

Article (Scientific journals)

Stratégies d'échantillonnage pour l'apprentissage par renforcement batch

Fonteneau, Raphaël; Murphy, Susan A.; Wehenkel, Louis et al.

2013 • In Revue d'Intelligence Artificielle, 27 (2), p. 171-194

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/149120

DOI
10.3166/RIA.27.171-194

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

Fonteneau2013RIA.pdf

Author preprint (566.15 kB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Apprentissage par renforcement; apprentissage actif; contrôle optimal; Reinforcement learning; active learning; optimal control

Abstract :

[fr] Cet article présente deux stratégies d’échantillonnage dans le contexte de l’apprentissage par renforcement en mode “batch”. La première stratégie repose sur l’idée que les expériences susceptibles de mener à une modiﬁcation de la politique de décision courante sont particulièrement informatives. Etant donné a priori un algorithme d’inférence de politiques de décision ainsi qu’un modèle prédictif du système, une expérience est réalisée si, étant donné le modèle prédictif, cette expérience mène à l’apprentissage d’une politique de décision différente. La deuxième stratégie exploite des résultats récemment publiés pour calculer des bornes sur le retour des politiques de décision de manière à sélectionner des expériences améliorant la précision des bornes aﬁn de discriminer les politiques non-optimales. Ces deux stratégies sont illustrées sur des problèmes élémentaires et les résultats obtenus sont prometteurs.
[en] We propose two strategies for experiment selection in the context of batch mode reinforcement learning. The ﬁrst strategy is based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identiﬁcation method are given a priori. The second strategy exploits recently published methods for computing bounds on the return of control policies from a set of trajectories in order to sample the state-action space so as to be able to discriminate between optimal and non-optimal policies. Both strategies are experimentally validated, showing promising results.

Disciplines :

Computer science

Author, co-author :

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Murphy, Susan A.

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Language :

French

Title :

Stratégies d'échantillonnage pour l'apprentissage par renforcement batch

Publication date :

2013

Journal title :

Revue d'Intelligence Artificielle

ISSN :

0992-499X

Publisher :

Lavoisier, Cachan, France

Volume :

Issue :

Pages :

171-194

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 23 May 2013

Statistics

Number of views

222 (16 by ULiège)

Number of downloads

378 (5 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Auer P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, vol. 3, p. 397-422.
Aurenhammer F. (1991). Voronoi diagrams-a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR), vol. 23, no 3, p. 345-405.
Bubeck S., Munos R., Stoltz G., Szepesvári C. (2009). Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems 21 (NIPS 2009), p. 201-208. MIT Press.
Busoniu L., Babuska R., De Schutter B., Ernst D. (2010). Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis CRC Press.
Castronovo M., Maes F., Fonteneau R., Ernst D. (2012, June). Learning exploration/exploitation strategies for single trajectory reinforcement learning. In 10th european workshop on reinforcement learning (EWRL 2012). Edinburgh, Scotland.
Cohen J., McClure S., Yu A. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B 29, vol. 362, no 1481, p. 933-942.
Ephsteyn A., Vogel A., DeJong G. (2008). Active reinforcement learning. In Proceedings of the 25th international conference on machine learning (ICML 2008), vol. 307.
Ernst D. (2005). Selecting concise sets of samples for a reinforcement learning agent. In Proceedings of the third international conference on computational intelligence, robotics and autonomous systems (CIRAS 2005). Singapore.
Ernst D., Geurts P., Wehenkel L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, vol. 6, p. 503-556.
Fonteneau R. (2011). Contributions to Batch Mode Reinforcement Learning. PhD Thesis, University of Liège.
Fonteneau R., Murphy S., Wehenkel L., Ernst D. (2010). Generating informative trajectories by using bounds on the return of control policies. In Proceedings of the workshop on active learning and experimental design 2010 (in conjunction with AISTATS 2010).
Fonteneau R., Murphy S., Wehenkel L., Ernst D. (2011a). Active exploration by searching for experiments falsifying an already induced policy. In Proceedings of the 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (IEEE ADPRL 2011), p. 40-47.
Fonteneau R., Murphy S. A., Wehenkel L., Ernst D. (2011b). Towards min max generalization in reinforcement learning. In Agents and artificial intelligence: International conference, ICAART 2010, valencia, spain, january 2010, revised selected papers. series: Communications in computer and information science (ccis), vol. 129, p. 61-77. Springer, Heidelberg.
Ingersoll J. (1987). Theory of financial decision making. Rowman and Littlefield Publishers, Inc.
Kaelbling L. (1993). Learning in Embedded Systems. MIT Press.
Maes F., Wehenkel L., Ernst D. (2011, September). Automatic discovery of ranking formulas for playing with multi-armed bandits. In 9th european workshop on reinforcement learning (EWRL 2011). Athens, Greece.
Munos R., Moore A. (2002). Variable resolution discretization in optimal control. Machine Learning, vol. 49, p. 291-323.
Murphy S. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B, vol. 65(2), p. 331-366.
Murphy S. (2005). An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, vol. 24, p. 1455-1481.
Ormoneit D., Sen S. (2002). Kernel-based reinforcement learning. Machine Learning, vol. 49, no 2-3, p. 161-178.
Rachelson E., Schnitzler F., Wehenkel L., Ernst D. (2011). Optimal sample selection for batch-mode reinforcement learning. In 3rd international conference on agents and artificial intelligence (ICAART 2011).
Riedmiller M. (2005). Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method. In Proceedings of the sixteenth european conference on machine learning (ECML 2005), p. 317-328. Porto, Portugal.
Sutton R., Barto A. (1998). Reinforcement Learning. MIT Press.
Thrun S. (1992). The role of exploration in learning control. In D. White, D. Sofge (Eds.), Handbook for intelligent control: Neural, fuzzy and adaptive approaches. Van Nostrand Reinhold.
Viterbi A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, vol. 13, no 2, p. 260-269.