reinforcement learning; min max generalization; nonconvex optimization; computational complexity
Abstract :
[en] We study the min max optimization problem introduced in Fonteneau et al. [Towards min max reinforcement learning, ICAART 2010, Springer, Heidelberg, 2011, pp. 61–77] for computing policies for batch mode reinforcement learning in a deterministic setting with fixed, finite time horizon. First, we show that the min part of this problem is NP-hard. We then provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, can also be solved in polynomial time. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [Fonteneau et al., 2011, as cited above].
Disciplines :
Computer science
Author, co-author :
Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Boigelot, Bernard ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Informatique
Louveaux, Quentin ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Système et modélisation : Optimisation discrète
Language :
English
Title :
Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.
Bibliography
T. Başar and P. Bernhard, H∞-Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach, Vol. 5, Birkhäuser, Boston, 1995.
A. Bemporad and M. Morari, Robust model predictive control: A survey, in Robustness in Identification and Control, Lecture Notes in Control and Inform. Sci. 245, 1999, pp. 207-226.
J.R. Birge and F. Louveaux, Introduction to Stochastic Programming, Springer Verlag, New York, 1997.
S. Boyd, L. El-Ghaoui, E. Feron, V. Balakrishnan, and E.E. Yaz, Linear matrix inequalities in system and control theory, Proc. IEEE, 85 (1997), pp. 698-699.
S.J. Bradtke and A.G. Barto, Linear least-squares algorithms for temporal difference learning, Mach. Learn., 22 (1996), pp. 33-57.
L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming using Function Approximators, CRC Press, Boca Raton, FL, 2010.
E.F. Camacho and C. Bordons, Model Predictive Control, Springer, London, 2004.
A. d'Aspremont and S. Boyd, Relaxations and Randomized Methods for Nonconvex QCAPs, EE3920 Class Notes, Stanford University, Stanford, CA, 2003.
B. Defourny, D. Ernst, and L. Wehenkel, Risk-aware decision making and dynamic programming, NIPS-08 Workshop on Model Uncertainty and Risk in Reinforcement Learning, Whistler, Canada, 2008.
E. Delage and S. Mannor, Percentile optimization for Markov decision processes with parameter uncertainty, Oper. Res., 58 (2010), pp. 203-213.
D. Ernst, P. Geurts, and L. Wehenkel, Tree-based batch mode reinforcement learning, J. Mach. Learn. Res., 6 (2005), pp. 503-556.
D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, Reinforcement learning versus model predictive control: A comparison on a power system problem, IEEE Trans. Syst., Man, Cybernet. Part B, 39 (2009), pp. 517-529.
R. Fonteneau, D. Ernst, B. Boigelot, and Q. Louveaux, Min Max Generalization for Two-Stage Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes, Technical report, University of Liège, Liège, Belgium, 2012.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Inferring bounds on the performance of a control policy from a sample of trajectories, in Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 09), Nashville, TN, IEEE, Piscataway, NJ, 2009, pp. 117-123.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, A cautious approach to generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Computing Bounds for Kernel-Based Policy Evaluation in Reinforcement Learning, Technical Report, University of Liège, Liège, Belgium, 2010.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Towards min max generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, Contributions to Batch Mode Reinforcement Learning, Ph.D. thesis, University of Liège, Liège, Belgium, 2011.
R.M. Freund and J.B. Orlin, On the complexity of four polyhedral set containment problems, Math. Program., 33 (1985), pp. 139-145.
L.P. Hansen and T.J. Sargent, Robust control and model uncertainty, Amer. Econom. Rev., 91 (2001), pp. 60-66.
D. Henrion, S. Tarbouriech, and D. Arzelier, LMI approximations for the radius of the intersection of ellipsoids: Survey, J. Optim. Theory Appl., 108 (2001), pp. 1-28.
J.B. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms: Fundamentals, Grundlehren Math. Wiss. 305, Springer-Verlag, Berlin, 1996.
J.E. Ingersoll, Theory of Financial Decision Making, Rowman and Littlefield, Totowa, NJ, 1987.
S. Koenig, Minimax real-time heuristic search, Artif. Intell., 129 (2001), pp. 165-197.
M.G. Lagoudakis and R. Parr, Least-squares policy iteration, J. Mach. Learn. Res., 4 (2003), pp. 1107-1149.
M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in Proceedings of the 11th International Conference on Machine Learning (ICML 1994), New Brunswick, NJ, 1994, Morgan Kaufman, San Francisco, pp. 157-163.
M.L. Littman, A tutorial on partially observable Markov decision processes, J. Math. Psychol., 53 (2009), pp. 119-125.
S. Mannor, D. Simester, P. Sun, and J.N. Tsitsiklis, Bias and variance in value function estimation, in Proceedings of the 21st International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, 2004, AAAI Press, Menlo Park, CA, 2004, 72.
S.A. Murphy, Optimal dynamic treatment regimes, J. Roy. Statist. Soc. Ser. B, 65 (2003), pp. 331-366.
S.A. Murphy, An experimental design for the development of adaptive treatment strategies, Statist. Med., 24 (2005), pp. 1455-1481.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574-1609.
Y. Nesterov, H.Wolkowicz, and Y. Ye, Semidefinite programming relaxations of nonconvex quadratic optimization, in Handbook of Semidefinite Programming, Kluwer, Boston, 2000, pp. 361-419.
D. Ormoneit and S. Sen, Kernel-based reinforcement learning, Mach. Learn., 49 (2002), pp. 161-178.
C. Paduraru, D. Precup, and J. Pineau, A framework for computing bounds for the return of a policy, in Ninth European Workshop on Reinforcement Learning (EWRL9), Dagstuhl, Germany, 2011.
C.H. Papadimitriou, On the complexity of integer programming, J. ACM, 28 (1981), pp. 765-768.
P.M. Pardalos and S.A. Vavasis, Quadratic programming with one negative eigenvalue is NP-hard, J. Global Optim., 1 (1991), pp. 15-22.
M. Qian and S.A. Murphy, Performance Guarantees for Individualized Treatment Rules, Technical report 498, Department of Statistics, University of Michigan, Ann Arbor, MI, 2009.
M. Riedmiller, Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method, in Proceedings of the 16th European Conference on Machine Learning (ECML 2005), Porto, Portugal, Springer, Berlin, 2005, pp. 317-328.
M. Rovatous and M. Lagoudakis, Minimax search and reinforcement learning for adversarial tetris, in Proceedings of the Sixth Hellenic Conference on Artificial Intelligence (SETN'10), Athens, Greece, 2010.
P. Scokaert and D. Mayne, Min-max feedback model predictive control for constrained linear systems, IEEE Trans. Automat. Control, 43 (1998), pp. 1136-1142.
A. Shapiro, A dynamic programming approach to adjustable robust optimization, Oper. Res. Lett., 39 (2011), pp. 83-87.
A. Shapiro, Minimax and Risk Averse Multistage Stochastic Programming, Technical report, School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 2011.
J.F. Sturm, Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones, Optim. Methods Softw., 11 (1999), pp. 625-653.
R.S. Sutton and A.G. Barto, Reinforcement Learning, MIT Press, Cambridge, MA, 1998.
L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Rev., 38 (1996), pp. 49-95.
Similar publications
Sorry the service is unavailable at the moment. Please try again later.
This website uses cookies to improve user experience. Read more
Save & Close
Accept all
Decline all
Show detailsHide details
Cookie declaration
About cookies
Strictly necessary
Performance
Strictly necessary cookies allow core website functionality such as user login and account management. The website cannot be used properly without strictly necessary cookies.
This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.
Performance cookies are used to see how visitors use the website, eg. analytics cookies. Those cookies cannot be used to directly identify a certain visitor.
Used to store the attribution information, the referrer initially used to visit the website
Cookies are small text files that are placed on your computer by websites that you visit. Websites use cookies to help users navigate efficiently and perform certain functions. Cookies that are required for the website to operate properly are allowed to be set without your permission. All other cookies need to be approved before they can be set in the browser.
You can change your consent to cookie usage at any time on our Privacy Policy page.