Reinforcement Learning; Formula Discovery; Interpretability
Abstract :
[en] In this paper, we address the problem of computing interpretable solutions to reinforcement learning (RL) problems. To this end, we propose a search algorithm over a space of simple losed-form formulas that are used to rank actions. We formalize the search for a high-performance policy as a multi-armed bandit problem where each arm corresponds to a candidate policy canonically represented by its shortest formula-based representation. Experiments, conducted on standard benchmarks, show that this approach manages to determine both efficient and interpretable solutions.
Disciplines :
Computer science
Author, co-author :
Maes, Francis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Language :
English
Title :
Policy search in a space of simple closed-form formulas: towards interpretability of reinforcement learning
Publication date :
October 2012
Event name :
The Fifteenth International Conference on Discovery Science (DS 2012)
Event date :
29-31 October, 2012
Audience :
International
Main work title :
Discovery Science 15th International Conference, DS 2012, Lyon, France, October 29-31, 2012. Proceedings
Adams, B., Banks, H., Kwon, H.D., Tran, H.: Dynamic multidrug therapies for HIV: Optimal and STI approaches. Mathematical Biosciences and Engineering 1, 223-241 (2004)
Audibert, J.-Y., Munos, R., Szepesvari, C.: Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150-165. Springer, Heidelberg (2007)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2), 235-256 (2002)
Bar-Gad, I., Morris, G., Bergman, H.: Information processing, dimensionality reduction and reinforcement learning in the basal ganglia. Progress in Neurobiology 71(6), 439-473 (2003)
Barron, A.R., Cover, T.M.: Minimum complexity density estimation. IEEE Transactions on Information Theory 37(4), 1034-1054 (1991)
Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis, CRC Press (2010)
Castelletti, A., Galelli, S., Restelli, M., Soncini-Sessa, R.: Tree-based variable selection for dimensionality reduction of large-scale control systems. In: Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 62-69. IEEE (2011)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503-556 (2005)
Fonteneau, R., Wehenkel, L., Ernst, D.: Variable selection for dynamic treatment regimes: a reinforcement learning approach. In: European Workshop on Reinforcement Learning, EWRL (2008)
Gearhart, C.: Genetic programming as policy search in markov decision processes. In: Genetic Algorithms and Genetic Programming at Stanford, pp. 61-67 (2003)
Girgin, S., Preux, P.: Feature Discovery in Reinforcement Learning Using Genetic Programming. In: O'Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcazar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008. LNCS, vol. 4971, pp. 218-229. Springer, Heidelberg (2008)
Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-wesley (1989)
Guez, A., Vincent, R., Avoli, M., Pineau, J.: Adaptive treatment of epilepsy via batch-mode reinforcement learning. In: Innovative Applications of Artificial Intelligence (IAAI), pp. 1671-1678 (2008)
Gunter, L., Zhu, J., Murphy, S.: Variable Selection for Optimal Decision Making. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 149-154. Springer, Heidelberg (2007)
Hren, J.-F., Munos, R.: Optimistic Planning of Deterministic Systems. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds.) EWRL 2008. LNCS (LNAI), vol. 5323, pp. 151-164. Springer, Heidelberg (2008)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2005)
Ingersoll, J.: Theory of Financial Decision Making. Rowman and Littlefield Publishers, Inc. (1987)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems of Information Transmission 1(1), 1-7 (1965)
Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 5-17. Springer, Heidelberg (2012)
Maes, F., Wehenkel, L., Ernst, D.: Optimized Look-ahead Tree Search Policies. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 189-200. Springer, Heidelberg (2012)
Moore, A., Atkeson, C.: The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning 21(3), 199-233 (1995)
Murphy, S.: Optimal dynamic treatment regimes. Journal of the Royal Statistical Society, Series B 65(2), 331-366 (2003)
Randløv, J., Alstrøm, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pp. 463-471. Citeseer (1998)
Riedmiller, M.: Neural Fitted Q Iteration-First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317-328. Springer, Heidelberg (2005)
Rubinstein, R., Kroese, D.: The Cross-Entropy Method. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Information Science and Statistics. Springer (2004)