reinforcement learning; min max generalization; nonconvex optimization; computational complexity
Abstract :
[en] We study the min max optimization problem introduced in Fonteneau et al. [Towards min max reinforcement learning, ICAART 2010, Springer, Heidelberg, 2011, pp. 61–77] for computing policies for batch mode reinforcement learning in a deterministic setting with fixed, finite time horizon. First, we show that the min part of this problem is NP-hard. We then provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, can also be solved in polynomial time. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [Fonteneau et al., 2011, as cited above].
Disciplines :
Computer science
Author, co-author :
Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Boigelot, Bernard ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Informatique
Louveaux, Quentin ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Système et modélisation : Optimisation discrète
Language :
English
Title :
Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes
T. Başar and P. Bernhard, H∞-Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach, Vol. 5, Birkhäuser, Boston, 1995.
A. Bemporad and M. Morari, Robust model predictive control: A survey, in Robustness in Identification and Control, Lecture Notes in Control and Inform. Sci. 245, 1999, pp. 207-226.
J.R. Birge and F. Louveaux, Introduction to Stochastic Programming, Springer Verlag, New York, 1997.
S. Boyd, L. El-Ghaoui, E. Feron, V. Balakrishnan, and E.E. Yaz, Linear matrix inequalities in system and control theory, Proc. IEEE, 85 (1997), pp. 698-699.
S.J. Bradtke and A.G. Barto, Linear least-squares algorithms for temporal difference learning, Mach. Learn., 22 (1996), pp. 33-57.
L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming using Function Approximators, CRC Press, Boca Raton, FL, 2010.
E.F. Camacho and C. Bordons, Model Predictive Control, Springer, London, 2004.
A. d'Aspremont and S. Boyd, Relaxations and Randomized Methods for Nonconvex QCAPs, EE3920 Class Notes, Stanford University, Stanford, CA, 2003.
B. Defourny, D. Ernst, and L. Wehenkel, Risk-aware decision making and dynamic programming, NIPS-08 Workshop on Model Uncertainty and Risk in Reinforcement Learning, Whistler, Canada, 2008.
E. Delage and S. Mannor, Percentile optimization for Markov decision processes with parameter uncertainty, Oper. Res., 58 (2010), pp. 203-213.
D. Ernst, P. Geurts, and L. Wehenkel, Tree-based batch mode reinforcement learning, J. Mach. Learn. Res., 6 (2005), pp. 503-556.
D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, Reinforcement learning versus model predictive control: A comparison on a power system problem, IEEE Trans. Syst., Man, Cybernet. Part B, 39 (2009), pp. 517-529.
R. Fonteneau, D. Ernst, B. Boigelot, and Q. Louveaux, Min Max Generalization for Two-Stage Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes, Technical report, University of Liège, Liège, Belgium, 2012.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Inferring bounds on the performance of a control policy from a sample of trajectories, in Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 09), Nashville, TN, IEEE, Piscataway, NJ, 2009, pp. 117-123.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, A cautious approach to generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Computing Bounds for Kernel-Based Policy Evaluation in Reinforcement Learning, Technical Report, University of Liège, Liège, Belgium, 2010.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Towards min max generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, Contributions to Batch Mode Reinforcement Learning, Ph.D. thesis, University of Liège, Liège, Belgium, 2011.
R.M. Freund and J.B. Orlin, On the complexity of four polyhedral set containment problems, Math. Program., 33 (1985), pp. 139-145.
L.P. Hansen and T.J. Sargent, Robust control and model uncertainty, Amer. Econom. Rev., 91 (2001), pp. 60-66.
D. Henrion, S. Tarbouriech, and D. Arzelier, LMI approximations for the radius of the intersection of ellipsoids: Survey, J. Optim. Theory Appl., 108 (2001), pp. 1-28.
J.B. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms: Fundamentals, Grundlehren Math. Wiss. 305, Springer-Verlag, Berlin, 1996.
J.E. Ingersoll, Theory of Financial Decision Making, Rowman and Littlefield, Totowa, NJ, 1987.
S. Koenig, Minimax real-time heuristic search, Artif. Intell., 129 (2001), pp. 165-197.
M.G. Lagoudakis and R. Parr, Least-squares policy iteration, J. Mach. Learn. Res., 4 (2003), pp. 1107-1149.
M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in Proceedings of the 11th International Conference on Machine Learning (ICML 1994), New Brunswick, NJ, 1994, Morgan Kaufman, San Francisco, pp. 157-163.
M.L. Littman, A tutorial on partially observable Markov decision processes, J. Math. Psychol., 53 (2009), pp. 119-125.
S. Mannor, D. Simester, P. Sun, and J.N. Tsitsiklis, Bias and variance in value function estimation, in Proceedings of the 21st International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, 2004, AAAI Press, Menlo Park, CA, 2004, 72.
S.A. Murphy, Optimal dynamic treatment regimes, J. Roy. Statist. Soc. Ser. B, 65 (2003), pp. 331-366.
S.A. Murphy, An experimental design for the development of adaptive treatment strategies, Statist. Med., 24 (2005), pp. 1455-1481.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574-1609.
Y. Nesterov, H.Wolkowicz, and Y. Ye, Semidefinite programming relaxations of nonconvex quadratic optimization, in Handbook of Semidefinite Programming, Kluwer, Boston, 2000, pp. 361-419.
D. Ormoneit and S. Sen, Kernel-based reinforcement learning, Mach. Learn., 49 (2002), pp. 161-178.
C. Paduraru, D. Precup, and J. Pineau, A framework for computing bounds for the return of a policy, in Ninth European Workshop on Reinforcement Learning (EWRL9), Dagstuhl, Germany, 2011.
C.H. Papadimitriou, On the complexity of integer programming, J. ACM, 28 (1981), pp. 765-768.
P.M. Pardalos and S.A. Vavasis, Quadratic programming with one negative eigenvalue is NP-hard, J. Global Optim., 1 (1991), pp. 15-22.
M. Qian and S.A. Murphy, Performance Guarantees for Individualized Treatment Rules, Technical report 498, Department of Statistics, University of Michigan, Ann Arbor, MI, 2009.
M. Riedmiller, Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method, in Proceedings of the 16th European Conference on Machine Learning (ECML 2005), Porto, Portugal, Springer, Berlin, 2005, pp. 317-328.
M. Rovatous and M. Lagoudakis, Minimax search and reinforcement learning for adversarial tetris, in Proceedings of the Sixth Hellenic Conference on Artificial Intelligence (SETN'10), Athens, Greece, 2010.
P. Scokaert and D. Mayne, Min-max feedback model predictive control for constrained linear systems, IEEE Trans. Automat. Control, 43 (1998), pp. 1136-1142.
A. Shapiro, A dynamic programming approach to adjustable robust optimization, Oper. Res. Lett., 39 (2011), pp. 83-87.
A. Shapiro, Minimax and Risk Averse Multistage Stochastic Programming, Technical report, School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 2011.
J.F. Sturm, Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones, Optim. Methods Softw., 11 (1999), pp. 625-653.
R.S. Sutton and A.G. Barto, Reinforcement Learning, MIT Press, Cambridge, MA, 1998.
L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Rev., 38 (1996), pp. 49-95.