Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes

[en] We study the min max optimization problem introduced in Fonteneau et al. [Towards min max reinforcement learning, ICAART 2010, Springer, Heidelberg, 2011, pp. 61–77] for computing policies for batch mode reinforcement learning in a deterministic setting with ﬁxed, ﬁnite time horizon. First, we show that the min part of this problem is NP-hard. We then provide two relaxation schemes. The ﬁrst relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, can also be solved in polynomial time. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [Fonteneau et al., 2011, as cited above].

Disciplines :

Computer science

Author, co-author :

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Boigelot, Bernard ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Informatique

Louveaux, Quentin ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Système et modélisation : Optimisation discrète

Language :

English

Title :

Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes

Publication date :

2013

Journal title :

SIAM Journal on Control and Optimization

ISSN :

0363-0129

eISSN :

1095-7138

Publisher :

Society for Industrial & Applied Mathematics

Volume :

Issue :

Pages :

3355–3385

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBi :

since 29 September 2013

Statistics

Number of views

158 (25 by ULiège)

Number of downloads

298 (16 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

publications

supporting

mentioning

contrasting

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Bibliography

T. Başar and P. Bernhard, H∞-Optimal Control and Related Minimax Design Problems: A Dynamic Game Approach, Vol. 5, Birkhäuser, Boston, 1995.
A. Bemporad and M. Morari, Robust model predictive control: A survey, in Robustness in Identification and Control, Lecture Notes in Control and Inform. Sci. 245, 1999, pp. 207-226.
D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996.
J.R. Birge and F. Louveaux, Introduction to Stochastic Programming, Springer Verlag, New York, 1997.
S. Boyd, L. El-Ghaoui, E. Feron, V. Balakrishnan, and E.E. Yaz, Linear matrix inequalities in system and control theory, Proc. IEEE, 85 (1997), pp. 698-699.
S.J. Bradtke and A.G. Barto, Linear least-squares algorithms for temporal difference learning, Mach. Learn., 22 (1996), pp. 33-57.
L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming using Function Approximators, CRC Press, Boca Raton, FL, 2010.
E.F. Camacho and C. Bordons, Model Predictive Control, Springer, London, 2004.
A.R. Conn, N.I.M. Gould, and P.L. Toint, Trust-Region Methods, MPS-SIAM Ser. Optim. 1, SIAM, Philadelphia, 2000.
A. d'Aspremont and S. Boyd, Relaxations and Randomized Methods for Nonconvex QCAPs, EE3920 Class Notes, Stanford University, Stanford, CA, 2003.
B. Defourny, D. Ernst, and L. Wehenkel, Risk-aware decision making and dynamic programming, NIPS-08 Workshop on Model Uncertainty and Risk in Reinforcement Learning, Whistler, Canada, 2008.
E. Delage and S. Mannor, Percentile optimization for Markov decision processes with parameter uncertainty, Oper. Res., 58 (2010), pp. 203-213.
D. Ernst, P. Geurts, and L. Wehenkel, Tree-based batch mode reinforcement learning, J. Mach. Learn. Res., 6 (2005), pp. 503-556.
D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, Reinforcement learning versus model predictive control: A comparison on a power system problem, IEEE Trans. Syst., Man, Cybernet. Part B, 39 (2009), pp. 517-529.
R. Fonteneau, D. Ernst, B. Boigelot, and Q. Louveaux, Min Max Generalization for Two-Stage Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes, Technical report, University of Liège, Liège, Belgium, 2012.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Inferring bounds on the performance of a control policy from a sample of trajectories, in Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 09), Nashville, TN, IEEE, Piscataway, NJ, 2009, pp. 117-123.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, A cautious approach to generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Computing Bounds for Kernel-Based Policy Evaluation in Reinforcement Learning, Technical Report, University of Liège, Liège, Belgium, 2010.
R. Fonteneau, S.A. Murphy, L. Wehenkel, and D. Ernst, Towards min max generalization in reinforcement learning, in Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, 2010, Commun. Comp. Inform. Sci. 129, Springer, Heidelberg, 2011, pp. 61-77.
R. Fonteneau, Contributions to Batch Mode Reinforcement Learning, Ph.D. thesis, University of Liège, Liège, Belgium, 2011.
R.M. Freund and J.B. Orlin, On the complexity of four polyhedral set containment problems, Math. Program., 33 (1985), pp. 139-145.
L.P. Hansen and T.J. Sargent, Robust control and model uncertainty, Amer. Econom. Rev., 91 (2001), pp. 60-66.
D. Henrion, S. Tarbouriech, and D. Arzelier, LMI approximations for the radius of the intersection of ellipsoids: Survey, J. Optim. Theory Appl., 108 (2001), pp. 1-28.
J.B. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms: Fundamentals, Grundlehren Math. Wiss. 305, Springer-Verlag, Berlin, 1996.
J.E. Ingersoll, Theory of Financial Decision Making, Rowman and Littlefield, Totowa, NJ, 1987.
S. Koenig, Minimax real-time heuristic search, Artif. Intell., 129 (2001), pp. 165-197.
M.G. Lagoudakis and R. Parr, Least-squares policy iteration, J. Mach. Learn. Res., 4 (2003), pp. 1107-1149.
M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in Proceedings of the 11th International Conference on Machine Learning (ICML 1994), New Brunswick, NJ, 1994, Morgan Kaufman, San Francisco, pp. 157-163.
M.L. Littman, A tutorial on partially observable Markov decision processes, J. Math. Psychol., 53 (2009), pp. 119-125.
S. Mannor, D. Simester, P. Sun, and J.N. Tsitsiklis, Bias and variance in value function estimation, in Proceedings of the 21st International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, 2004, AAAI Press, Menlo Park, CA, 2004, 72.
S.A. Murphy, Optimal dynamic treatment regimes, J. Roy. Statist. Soc. Ser. B, 65 (2003), pp. 331-366.
S.A. Murphy, An experimental design for the development of adaptive treatment strategies, Statist. Med., 24 (2005), pp. 1455-1481.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574-1609.
Y. Nesterov, H.Wolkowicz, and Y. Ye, Semidefinite programming relaxations of nonconvex quadratic optimization, in Handbook of Semidefinite Programming, Kluwer, Boston, 2000, pp. 361-419.
D. Ormoneit and S. Sen, Kernel-based reinforcement learning, Mach. Learn., 49 (2002), pp. 161-178.
C. Paduraru, D. Precup, and J. Pineau, A framework for computing bounds for the return of a policy, in Ninth European Workshop on Reinforcement Learning (EWRL9), Dagstuhl, Germany, 2011.
C.H. Papadimitriou, On the complexity of integer programming, J. ACM, 28 (1981), pp. 765-768.
C.H. Papadimitriou, Computational Complexity, Addison-Wesley, Reading, MA, 2003.
P.M. Pardalos and S.A. Vavasis, Quadratic programming with one negative eigenvalue is NP-hard, J. Global Optim., 1 (1991), pp. 15-22.
M. Qian and S.A. Murphy, Performance Guarantees for Individualized Treatment Rules, Technical report 498, Department of Statistics, University of Michigan, Ann Arbor, MI, 2009.
M. Riedmiller, Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method, in Proceedings of the 16th European Conference on Machine Learning (ECML 2005), Porto, Portugal, Springer, Berlin, 2005, pp. 317-328.
M. Rovatous and M. Lagoudakis, Minimax search and reinforcement learning for adversarial tetris, in Proceedings of the Sixth Hellenic Conference on Artificial Intelligence (SETN'10), Athens, Greece, 2010.
P. Scokaert and D. Mayne, Min-max feedback model predictive control for constrained linear systems, IEEE Trans. Automat. Control, 43 (1998), pp. 1136-1142.
A. Shapiro, A dynamic programming approach to adjustable robust optimization, Oper. Res. Lett., 39 (2011), pp. 83-87.
A. Shapiro, Minimax and Risk Averse Multistage Stochastic Programming, Technical report, School of Industrial & Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 2011.
J.F. Sturm, Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones, Optim. Methods Softw., 11 (1999), pp. 625-653.
R.S. Sutton and A.G. Barto, Reinforcement Learning, MIT Press, Cambridge, MA, 1998.
L. Vandenberghe and S. Boyd, Semidefinite programming, SIAM Rev., 38 (1996), pp. 49-95.

Similar publications

Sorry the service is unavailable at the moment. Please try again later.

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website