Reinforcement Learning and Dynamic Programming using Function Approximators

Busoniu, Lucian; Babuska, Robert; De Schutter, Bart; Ernst, Damien

doi:10.1201/9781439821091

Download

Book published as author, translator, etc. (Books)

Reinforcement Learning and Dynamic Programming using Function Approximators

Busoniu, Lucian; Babuska, Robert; De Schutter, Bart et al.

2010 • CRC Press

Permalink
https://hdl.handle.net/2268/27963

DOI
10.1201/9781439821091

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

book-FA-RL-DP.pdf

Publisher postprint (8.21 MB)

Download

Annexes

K11117_cover.pdf

Publisher postprint (3.55 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

Busoniu, Lucian

Babuska, Robert

De Schutter, Bart

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

Reinforcement Learning and Dynamic Programming using Function Approximators

Publication date :

April 2010

Publisher :

CRC Press

ISBN/EAN :

978-1-4398-2108-4

Number of pages :

282

Collection name :

Automation & control engineering

Additional URL :

http://www.dcsc.tudelft.nl/rlbook/

Available on ORBi :

since 12 February 2010

Statistics

Number of views

640 (44 by ULiège)

Number of downloads

36762 (126 by ULiège)

More statistics

Scopus citations^®

712

Scopus citations^®
without self-citations

673

OpenCitations

278

OpenAlex citations

928

Bibliography

Åström, K. J., Klein, R. E., and Lennartsson, A. (2005). Bicycle dynamics and control. IEEE Control Systems Magazine, 24(4):26–47.
Abonyi, J., Babuška, R., and Szeifert, F. (2001). Fuzzy modeling with multivariate membership functions: Gray-box identification and control design. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 31(5):755–767.
Adams, B., Banks, H., Kwon, H.-D., and Tran, H. (2004). Dynamic multidrug therapies for HIV: Optimal and STI control approaches. Mathematical Biosciences and Engineering, 1(2):223–241.
Antos, A., Munos, R., and Szepesvári, Cs. (2008a). Fitted Q-iteration in continuous action-space MDPs. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 9–16. MIT Press.
Antos, A., Szepesvári, Cs., and Munos, R. (2008b). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
Ascher, U. and Petzold, L. (1998). Computer methods for ordinary differential equations and differential-algebraic equations. Society for Industrial and Applied Mathematics (SIAM).
Audibert, J.-Y., Munos, R., and Szepesvári, Cs. (2007). Tuning bandit algorithms in stochastic environments. In Proceedings 18th International Conference on Algorithmic Learning Theory (ALT-07), pages 150–165, Sendai, Japan.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of multiarmed bandit problems. Machine Learning, 47(2–3):235–256.
Auer, P., Jaksch, T., and Ortner, R. (2009). Near-optimal regret bounds for reinforcement learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 89–96. MIT Press.
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings 12th International Conference on Machine Learning (ICML-95), pages 30–37, Tahoe City, US.
Balakrishnan, S., Ding, J., and Lewis, F. (2008). Issues on stability of ADP feedback controllers for dynamical systems. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 4(38):913–917.
Barash, D. (1999). A genetic search in policy space for solving Markov decision processes. In AAAI Spring Symposium on Search Techniques for Problem Solving under Uncertainty and Incomplete Information, Palo Alto, US.
Barto, A. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems: Theory and Applications, 13(4):341– 379.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5):833–846.
Berenji, H. R. and Khedkar, P. (1992). Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 3(5):724–740.
Berenji, H. R. and Vengerov, D. (2003). A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters. IEEE Transactions on Fuzzy Systems, 11(4):478–485.
Bertsekas, D. P. (2005a). Dynamic Programming and Optimal Control, volume 1. Athena Scientific, 3rd edition.
Bertsekas, D. P. (2005b). Dynamic programming and suboptimal control: A survey from ADP to MPC. European Journal of Control, 11(4–5):310–334. Special issue for the CDC-ECC-05 in Seville, Spain.
Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 3rd edition.
Bertsekas, D. P., Borkar, V., and Nedić, A. (2004). Improved temporal difference methods with linear function approximation. In Si, J., Barto, A., and Powell, W., editors, Learning and Approximate Dynamic Programming. IEEE Press.
Bertsekas, D. P. and Castañon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6):589–598.
Bertsekas, D. P. and Ioffe, S. (1996). Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, Massachusetts Institute of Technology, Cambridge, US. Available at http://web.mit.edu/dimitrib/www/Tempdif.pdf.
Bertsekas, D. P. and Shreve, S. E. (1978). Stochastic Optimal Control: The Discrete Time Case. Academic Press.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
Bertsekas, D. P. and Yu, H. (2009). Basis function adaptation methods for cost approximation in MDP. In Proceedings 2009 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-09), pages 74–81, Nashville, US.
Bethke, B., How, J., and Ozdaglar, A. (2008). Approximate dynamic programming using support vector regression. In Proceedings 47th IEEE Conference on Decision and Control (CDC-08), pages 3811–3816, Cancun, Mexico.
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11):2471–2482.
Birge, J. R. and Louveaux, F. (1997). Introduction to Stochastic Programming. Springer.
Borkar, V. (2005). An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213.
Boubezoul, A., Paris, S., and Ouladsine, M. (2008). Application of the cross entropy method to the GLVQ algorithm. Pattern Recognition, 41(10):3173–3178.
Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233–246.
Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1–3):33–57.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. (1984). Classification and Regression Trees. Wadsworth International.
Brown, M. and Harris, C. (1994). Neurofuzzy Adaptive Modeling and Control. Prentice Hall.
Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. (2009). Online optimization in X-armed bandits. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 201–208. MIT Press.
Buşoniu, L., Babuška, R., and De Schutter, B. (2008a). A comprehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics. Part C: Applications and Reviews, 38(2):156–172.
Buşoniu, L., Ernst, D., De Schutter, B., and Babuška, R. (2007). Fuzzy approximation for convergent model-based reinforcement learning. In Proceedings 2007 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-07), pages 968– 973, London, UK.
Buşoniu, L., Ernst, D., De Schutter, B., and Babuška, R. (2008b). Consistency of fuzzy model-based reinforcement learning. In Proceedings 2008 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-08), pages 518–524, Hong Kong.
Buşoniu, L., Ernst, D., De Schutter, B., and Babuška, R. (2008c). Continuous-state reinforcement learning with fuzzy approximation. In Tuyls, K., Nowé, A., Guessoum, Z., and Kudenko, D., editors, Adaptive Agents and Multi-Agent Systems III, volume 4865 of Lecture Notes in Computer Science, pages 27–43. Springer.
Buşoniu, L., Ernst, D., De Schutter, B., and Babuška, R. (2008d). Fuzzy partition optimization for approximate fuzzy Q-iteration. In Proceedings 17th IFAC World Congress (IFAC-08), pages 5629–5634, Seoul, Korea.
Buşoniu, L., Ernst, D., De Schutter, B., and Babuška, R. (2009). Policy search with cross-entropy optimization of basis functions. In Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pages 153–160, Nashville, US.
Camacho, E. F. and Bordons, C. (2004). Model Predictive Control. Springer-Verlag.
Cao, X.-R. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach. Springer.
Chang, H. S., Fu, M. C., Hu, J., and Marcus, S. I. (2007). Simulation-Based Algorithms for Markov Decision Processes. Springer.
Chepuri, K. and de Mello, T. H. (2005). Solving the vehicle routing problem with stochastic demands using the cross-entropy method. Annals of Operations Research, 134(1):153–181.
Chin, H. H. and Jafari, A. A. (1998). Genetic algorithm methods for solving the best stationary policy of finite Markov decision processes. In Proceedings 30th Southeastern Symposium on System Theory, pages 538–543, Morgantown, US.
Chow, C.-S. and Tsitsiklis, J. N. (1991). An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control, 36(8):898–914.
Costa, A., Jones, O. D., and Kroese, D. (2007). Convergence properties of the cross-entropy method for discrete optimization. Operations Research Letters, 35(5):573–580.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press.
Davies, S. (1997). Multidimensional triangulation and interpolation for reinforcement learning. In Mozer, M. C., Jordan, M. I., and Petsche, T., editors, Advances in Neural Information Processing Systems 9, pages 1005–1011. MIT Press.
Defourny, B., Ernst, D., and Wehenkel, L. (2008). Lazy planning under uncertainties by optimizing decisions on an ensemble of incomplete disturbance trees. In Girgin, S., Loth, M., Munos, R., Preux, P., and Ryabko, D., editors, Recent Advances in Reinforcement Learning, volume 5323 of Lecture Notes in Computer Science, pages 1–14. Springer.
Defourny, B., Ernst, D., and Wehenkel, L. (2009). Planning under uncertainty, ensembles of disturbance trees and kernelized discrete action spaces. In Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pages 145–152, Nashville, US.
Deisenroth, M. P., Rasmussen, C. E., and Peters, J. (2009). Gaussian process dynamic programming. Neurocomputing, 72(7–9):1508–1524.
Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303.
Dimitrakakis, C. and Lagoudakis, M. (2008). Rollout sampling approximate policy iteration. Machine Learning, 72(3):157–171.
Dorigo, M. and Colombetti, M. (1994). Robot shaping: Developing autonomous agents through learning. Artificial Intelligence, 71(2):321–370.
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. Wiley, 2nd edition.
Dupacová, J., Consigli, G., and Wallace, S. W. (2000). Scenarios for multistage stochastic programs. Annals of Operations Research, 100(1–4):25–53.
Edelman, A. and Murakami, H. (1995). Polynomial roots from companion matrix eigenvalues. Mathematics of Computation, 64:763–776.
Engel, Y., Mannor, S., and Meir, R. (2003). Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In Proceedings 20th International Conference on Machine Learning (ICML-03), pages 154–161, Washington, US.
Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement learning with Gaussian processes. In Proceedings 22nd International Conference on Machine Learning (ICML-05), pages 201–208, Bonn, Germany.
Ernst, D. (2005). Selecting concise sets of samples for a reinforcement learning agent. In Proceedings 3rd International Conference on Computational Intelligence, Robotics and Autonomous Systems (CIRAS-05), Singapore.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556.
Ernst, D., Glavic, M., Capitanescu, F., and Wehenkel, L. (2009). Reinforcement learning versus model predictive control: A comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 39(2):517–529.
Ernst, D., Glavic, M., Geurts, P., and Wehenkel, L. (2006a). Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1). 37 pages.
Ernst, D., Glavic, M., Stan, G.-B., Mannor, S., and Wehenkel, L. (2007). The cross-entropy method for power system combinatorial optimization problems. In Proceedings of Power Tech 2007, pages 1290–1295, Lausanne, Switzerland.
Ernst, D., Stan, G.-B., Gonçalves, J., and Wehenkel, L. (2006b). Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. In Proceedings 45th IEEE Conference on Decision & Control, pages 667–672, San Diego, US.
Fantuzzi, C. and Rovatti, R. (1996). On the approximation capabilities of the homogeneous Takagi-Sugeno model. In Proceedings 5th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’96), pages 1067–1072, New Orleans, US.
Farahmand, A. M., Ghavamzadeh, M., Szepesvári, Cs., and Mannor, S. (2009a). Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In Proceedings 2009 American Control Conference (ACC-09), pages 725–730, St. Louis, US.
Farahmand, A. M., Ghavamzadeh, M., Szepesvári, Cs., and Mannor, S. (2009b). Regularized policy iteration. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 441–448. MIT Press.
Feldbaum, A. (1961). Dual control theory, Parts I and II. Automation and Remote Control, 21(9):874–880.
Franklin, G. F., Powell, J. D., and Workman, M. L. (1998). Digital Control of Dynamic Systems. Prentice Hall, 3rd edition.
Geramifard, A., Bowling, M., Zinkevich, M., and Sutton, R. S. (2007). iLSTD: Eligibility traces & convergence analysis. In Schölkopf, B., Platt, J., and Hofmann, T., editors, Advances in Neural Information Processing Systems 19, pages 440–448. MIT Press.
Geramifard, A., Bowling, M. H., and Sutton, R. S. (2006). Incremental least-squares temporal difference learning. In Proceedings 21st National Conference on Artificial Intelligence and 18th Innovative Applications of Artificial Intelligence Conference (AAAI-06), pages 356–361, Boston, US.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 36(1):3–42.
Ghavamzadeh, M. and Mahadevan, S. (2007). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 8:2629–2669.
Glorennec, P. Y. (2000). Reinforcement learning: An overview. In Proceedings European Symposium on Intelligent Techniques (ESIT-00), pages 17–35, Aachen, Germany.
Glover, F. and Laguna, M. (1997). Tabu Search. Kluwer.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.
Gomez, F. J., Schmidhuber, J., and Miikkulainen, R. (2006). Efficient non-linear control through neuroevolution. In Proceedings 17th European Conference on Machine Learning (ECML-06), volume 4212 of Lecture Notes in Computer Science, pages 654–662, Berlin, Germany.
Gonzalez, R. L. and Rofman, E. (1985). On deterministic control problems: An approximation procedure for the optimal cost I. The stationary problem. SIAM Journal on Control and Optimization, 23(2):242–266.
Gordon, G. (1995). Stable function approximation in dynamic programming. In Proceedings 12th International Conference on Machine Learning (ICML-95), pages 261–268, Tahoe City, US.
Gordon, G. J. (2001). Reinforcement learning with function approximation converges to a region. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 1040–1046. MIT Press.
Grüne, L. (2004). Error estimation and adaptive discretization for the discrete stochastic Hamilton-Jacobi-Bellman equation. Numerische Mathematik, 99(1):85–112.
Hassoun, M. (1995). Fundamentals of Artificial Neural Networks. MIT Press.
Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings 19th International Conference on Machine Learning (ICML-02), pages 243–250, Sydney, Australia.
Horiuchi, T., Fujino, A., Katai, O., and Sawaragi, T. (1996). Fuzzy interpolation-based Q-learning with continuous states and actions. In Proceedings 5th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-96), pages 594–600, New Orleans, US.
Hren, J.-F. and Munos, R. (2008). Optimistic planning of deterministic systems. In Girgin, S., Loth, M., Munos, R., Preux, P., and Ryabko, D., editors, Recent Advances in Reinforcement Learning, volume 5323 of Lecture Notes in Computer Science, pages 151–164. Springer.
Istratescu, V. I. (2002). Fixed Point Theory: An Introduction. Springer.
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185– 1201.
Jodogne, S., Briquet, C., and Piater, J. H. (2006). Approximate policy iteration for closed-loop learning of visual tasks. In Proceedings 17th European Conference on Machine Learning (ECML-06), volume 4212 of Lecture Notes in Computer Science, pages 210–221, Berlin, Germany.
Jones, D. R. (2009). DIRECT global optimization algorithm. In Floudas, C. A. and Pardalos, P. M., editors, Encyclopedia of Optimization, pages 725–735. Springer.
Jouffe, L. (1998). Fuzzy inference system learning by reinforcement methods. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, 28(3):338–355.
Jung, T. and Polani, D. (2007a). Kernelizing LSPE(λ). In Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-07), pages 338–345, Honolulu, US.
Jung, T. and Polani, D. (2007b). Learning robocup-keepaway with kernels. In Gaussian Processes in Practice, volume 1 of JMLR Workshop and Conference Proceedings, pages 33–57.
Jung, T. and Stone, P. (2009). Feature selection for value function approximation using Bayesian model selection. In Machine Learning and Knowledge Discovery in Databases, European Conference (ECML-PKDD-09), volume 5781 of Lecture Notes in Computer Science, pages 660–675, Bled, Slovenia.
Jung, T. and Uthmann, T. (2004). Experiments in value function approximation with sparse support vector regression. In Proceedings 15th European Conference on Machine Learning (ECML-04), volume 3201 of Lecture Notes in Artificial Intelligence, pages 180–191, Pisa, Italy.
Kaelbling, L. P. (1993). Learning in Embedded Systems. MIT Press.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1–2):99–134.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.
Kakade, S. (2001). A natural policy gradient. In Dietterich, T. G., Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14, pages 1531–1538. MIT Press.
Kalyanakrishnan, S. and Stone, P. (2007). Batch reinforcement learning in a complex domain. In Proceedings 6th International Conference on Autonomous Agents and Multi-Agent Systems, pages 650–657, Honolulu, US.
Keller, P. W., Mannor, S., and Precup, D. (2006). Automatic basis function construction for approximate dynamic programming and reinforcement learning. In Proceedings 23rd International Conference on Machine Learning (ICML-06), pages 449–456, Pittsburgh, US.
Khalil, H. K. (2002). Nonlinear Systems. Prentice Hall, 3rd edition.
Kirk, D. E. (2004). Optimal Control Theory: An Introduction. Dover Publications.
Klir, G. J. and Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall.
Knuth, D. E. (1976). Big Omicron and big Omega and big Theta. SIGACT News, 8(2):18–24.
Kolter, J. Z. and Ng, A. (2009). Regularization and feature selection in least-squares temporal difference learning. In Proceedings 26th International Conference on Machine Learning (ICML-09), pages 521–528, Montreal, Canada.
Konda, V. (2002). Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, Cambridge, US.
Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press.
Konda, V. R. and Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4):1143–1166.
Kruse, R., Gebhardt, J. E., and Klowon, F. (1994). Foundations of Fuzzy Systems. Wiley.
Lagoudakis, M., Parr, R., and Littman, M. (2002). Least-squares methods in reinforcement learning for control. In Methods and Applications of Artificial Intelligence, volume 2308 of Lecture Notes in Artificial Intelligence, pages 249–260. Springer.
Lagoudakis, M. G. and Parr, R. (2003a). Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149.
Lagoudakis, M. G. and Parr, R. (2003b). Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings 20th International Conference on Machine Learning (ICML-03), pages 424–431. Washington, US.
Levine, W. S., editor (1996). The Control Handbook. CRC Press.
Lewis, R. M. and Torczon, V. (2000). Pattern search algorithms for linearly constrained minimization. SIAM Journal on Optimization, 10(3):917–941.
Li, L., Littman, M. L., and Mansley, C. R. (2009). Online exploration in least-squares policy iteration. In Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-09), volume 2, pages 733–739, Budapest, Hungary.
Lin, C.-K. (2003). A reinforcement learning adaptive fuzzy controller for robots. Fuzzy Sets and Systems, 137(3):339–352.
Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3–4):293–321. Special issue on reinforcement learning.
Liu, D., Javaherian, H., Kovalenko, O., and Huang, T. (2008). Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 38(4):988–993.
Lovejoy, W. S. (1991). Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39(1):162–175.
Maciejowski, J. M. (2002). Predictive Control with Constraints. Prentice Hall.
Madani, O. (2002). On policy iteration as a Newton’s method and polynomial policy iteration algorithms. In Proceedings 18th National Conference on Artificial Intelligence and 14th Conference on Innovative Applications of Artificial Intelligence AAAI/IAAI-02, pages 273–278, Edmonton, Canada.
Mahadevan, S. (2005). Samuel meets Amarel: Automating value function approximation using global state space analysis. In Proceedings 20th National Conference on Artificial Intelligence and the 17th Innovative Applications of Artificial Intelligence Conference (AAAI-05), pages 1000–1005, Pittsburgh, US.
Mahadevan, S. and Maggioni, M. (2007). Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8:2169–2231.
Mamdani, E. (1977). Application of fuzzy logic to approximate reasoning using linguistic systems. IEEE Transactions on Computers, 26:1182–1191.
Mannor, S., Rubinstein, R. Y., and Gat, Y. (2003). The cross-entropy method for fast policy search. In Proceedings 20th International Conference on Machine Learning (ICML-03), pages 512–519, Washington, US.
Marbach, P. and Tsitsiklis, J. N. (2003). Approximate gradient methods in policy-space optimization of Markov reward processes. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2):111–148.
Matarić, M. J. (1997). Reinforcement learning in the multi-robot domain. Autonomous Robots, 4(1):73–83.
Mathenya, M. E., Resnic, F. S., Arora, N., and Ohno-Machado, L. (2007). Effects of SVM parameter optimization on discrimination and calibration for post-procedural PCI mortality. Journal of Biomedical Informatics, 40(6):688–697.
Melo, F. S., Meyn, S. P., and Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings 25th International Conference on Machine Learning (ICML-08), pages 664–671, Helsinki, Finland.
Menache, I., Mannor, S., and Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1):215–238.
Millán, J. d. R., Posenato, D., and Dedieu, E. (2002). Continuous-action Q-learning. Machine Learning, 49(2–3):247–265.
Moore, A. W. and Atkeson, C. R. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3):199–233.
Morris, C. (1982). Natural exponential families with quadratic variance functions. Annals of Statistics, 10(1):65–80.
Munos, R. (1997). Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. In Proceedings 9th European Conference on Machine Learning (ECML-97), volume 1224 of Lecture Notes in Artificial Intelligence, pages 170–182, Prague, Czech Republic.
Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791.
Munos, R. and Moore, A. (2002). Variable-resolution discretization in optimal control. Machine Learning, 49(2–3):291–323.
Munos, R. and Szepesvári, Cs. (2008). Finite time bounds for fitted value iteration. Journal of Machine Learning Research, 9:815–857.
Murphy, S. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6:1073–1097.
Nakamura, Y., Moria, T., Satoc, M., and Ishiia, S. (2007). Reinforcement learning for a biped robot based on a CPG-actor-critic method. Neural Networks, 20(6):723– 735.
Nedić, A. and Bertsekas, D. P. (2003). Least-squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2):79–110.
Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings 16th International Conference on Machine Learning (ICML-99), pages 278–287, Bled, Slovenia.
Ng, A. Y. and Jordan, M. I. (2000). PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings 16th Conference in Uncertainty in Artificial Intelligence (UAI-00), pages 406–415, Palo Alto, US.
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer-Verlag, 2nd edition.
Ormoneit, D. and Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49(2–3):161–178.
Panait, L. and Luke, S. (2005). Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems, 11(3):387–434.
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings 25th Annual International Conference on Machine Learning (ICML-08), pages 752–759, Helsinki, Finland.
Pazis, J. and Lagoudakis, M. (2009). Binary action search for learning continuous-action control policies. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML-09), pages 793–800, Montreal, Canada.
Pérez-Uribe, A. (2001). Using a time-delay actor-critic neural architecture with dopamine-like reinforcement signal for learning in autonomous robots. In Wermter, S., Austin, J., and Willshaw, D. J., editors, Emergent Neural Computational Architectures Based on Neuroscience, volume 2036 of Lecture Notes in Computer Science, pages 522–533. Springer.
Perkins, T. and Barto, A. (2002). Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3:803–832.
Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7–9):1180–1190.
Pineau, J., Gordon, G. J., and Thrun, S. (2006). Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research (JAIR), 27:335– 380.
Porta, J. M., Vlassis, N., Spaan, M. T., and Poupart, P. (2006). Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329– 2367.
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1986). Numerical Recipes: The Art of Scientific Computing. Cambridge University Press.
Prokhorov, D. and Wunsch, D.C., I. (1997). Adaptive critic designs. IEEE Transactions on Neural Networks, 8(5):997–1007.
Puterman, M. L. (1994). Markov Decision Processes—Discrete Stochastic Dynamic Programming. Wiley.
Randløv, J. and Alstrøm, P. (1998). Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings 15th International Conference on Machine Learning (ICML-98), pages 463–471, Madison, US.
Rasmussen, C. E. and Kuss, M. (2004). Gaussian processes in reinforcement learning. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16. MIT Press.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
Ratitch, B. and Precup, D. (2004). Sparse distributed memories for on-line value-based reinforcement learning. In Proceedings 15th European Conference on Machine Learning (ECML-04), volume 3201 of Lecture Notes in Computer Science, pages 347–358, Pisa, Italy.
Reynolds, S. I. (2000). Adaptive resolution model-free reinforcement learning: Decision boundary partitioning. In Proceedings Seventeenth International Conference on Machine Learning (ICML-00), pages 783–790, Stanford University, US.
Riedmiller, M. (2005). Neural fitted Q-iteration – first experiences with a data efficient neural reinforcement learning method. In Proceedings 16th European Conference on Machine Learning (ECML-05), volume 3720 of Lecture Notes in Computer Science, pages 317–328, Porto, Portugal.
Riedmiller, M., Peters, J., and Schaal, S. (2007). Evaluation of policy gradient methods and variants on the cart-pole benchmark. In Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL-07), pages 254–261, Honolulu, US.
Rubinstein, R. Y. and Kroese, D. P. (2004). The Cross Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer.
Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR166, Engineering Department, Cambridge University, UK. Available at http://mi.eng.cam.ac.uk/reports/svr-ftp/rummery tr166.ps.Z.
Russell, S. and Norvig, P. (2003). Artificial Intelligence: A Modern Approach. Prentice Hall, 2nd edition.
Russell, S. J. and Zimdars, A. (2003). Q-decomposition for reinforcement learning agents. In Proceedings 20th International Conference of Machine Learning (ICML-03), pages 656–663, Washington, US.
Santamaria, J. C., Sutton, R. S., and Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2):163–218.
Santos, M. S. and Vigo-Aguiar, J. (1998). Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica, 66(2):409–426.
Schervish, M. J. (1995). Theory of Statistics. Springer.
Schmidhuber, J. (2000). Sequential decision making based on direct search. In Sun, R. and Giles, C. L., editors, Sequence Learning, volume 1828 of Lecture Notes in Computer Science, pages 213–240. Springer.
Schölkopf, B., Burges, C., and Smola, A. (1999). Advances in Kernel Methods: Support Vector Learning. MIT Press.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press.
Sherstov, A. and Stone, P. (2005). Function approximation via tile coding: Automating parameter choice. In Proceedings 6th International Symposium on Abstraction, Reformulation and Approximation (SARA-05), volume 3607 of Lecture Notes in Computer Science, pages 194–205, Airth Castle, UK.
Shoham, Y., Powers, R., and Grenager, T. (2007). If multi-agent learning is the answer, what is the question? Artificial Intelligence, 171(7):365–377.
Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, Cs. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308.
Singh, S. and Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3):123–158.
Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 361–368. MIT Press.
Singh, S. P., James, M. R., and Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Proceedings 20th Conference in Uncertainty in Artificial Intelligence (UAI-04), pages 512–518, Banff, Canada.
Smola, A. J. and Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3):199–222.
Sutton, R., Maei, H., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., and Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings 26th International Conference on Machine Learning (ICML-09), pages 993–1000, Montreal, Canada.
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44.
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings 7th International Conference on Machine Learning (ICML-90), pages 216–224, Austin, US.
Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 1038–1044. MIT Press.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
Sutton, R. S., Barto, A. G., and Williams, R. J. (1992). Reinforcement learning is adaptive optimal control. IEEE Control Systems Magazine, 12(2):19–22.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press.
Sutton, R. S., Szepesvári, Cs., and Maei, H. R. (2009b). A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 1609–1616. MIT Press.
Szepesvári, Cs. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In Proceedings 22nd International Conference on Machine Learning (ICML-05), pages 880–887, Bonn, Germany.
Szepesvári, Cs. and Smart, W. D. (2004). Interpolation-based Q-learning. In Proceedings 21st International Conference on Machine Learning (ICML-04), pages 791–798, Bannf, Canada.
Takagi, T. and Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1):116–132.
Taylor, G. and Parr, R. (2009). Kernelized value function approximation for reinforcement learning. In Proceedings 26th International Conference on Machine Learning (ICML-09), pages 1017–1024, Montreal, Canada.
Thrun, S. (1992). The role of exploration in learning control. In White, D. and Sofge, D., editors, Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches. Van Nostrand Reinhold.
Torczon, V. (1997). On the convergence of pattern search algorithms. SIAM Journal on Optimization, 7(1):1–25.
Touzet, C. F. (1997). Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems, 22(3–4):251–281.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(1):185–202.
Tsitsiklis, J. N. (2002). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3:59–72.
Tsitsiklis, J. N. and Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22(1–3):59–94.
Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690.
Tuyls, K., Maes, S., and Manderick, B. (2002). Q-learning in simulated robotic soccer – large state spaces and incomplete information. In Proceedings 2002 International Conference on Machine Learning and Applications (ICMLA-02), pages 226–232, Las Vegas, US.
Uther, W. T. B. and Veloso, M. M. (1998). Tree based discretization for continuous state space reinforcement learning. In Proceedings 15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference (AAAI-98/IAAI-98), pages 769–774, Madison, US.
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., and Lewis, F. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 45(2):477–484.
Waldock, A. and Carse, B. (2008). Fuzzy Q-learning with an adaptive representation. In Proceedings 2008 IEEE World Congress on Computational Intelligence (WCCI-08), pages 720–725, Hong Kong.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, UK.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279– 292.
Whiteson, S. and Stone, P. (2006). Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 7:877–917.
Wiering, M. (2004). Convergence and divergence in standard and averaging reinforcement learning. In Proceedings 15th European Conference on Machine Learning (ECML-04), volume 3201 of Lecture Notes in Artificial Intelligence, pages 477–488, Pisa, Italy.
Williams, R. J. and Baird, L. C. (1994). Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings 8th Yale Workshop on Adaptive and Learning Systems, pages 108–113, New Haven, US.
Wodarz, D. and Nowak, M. A. (1999). Specific therapy regimes could lead to long-term immunological control of HIV. Proceedings of the National Academy of Sciences of the United States of America, 96(25):14464–14469.
Xu, X., Hu, D., and Lu, X. (2007). Kernel-based least-squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18(4):973–992.
Xu, X., Xie, T., Hu, D., and Lu, X. (2005). Kernel least-squares temporal difference learning. International Journal of Information Technology, 11(9):54–63.
Yen, J. and Langari, R. (1999). Fuzzy Logic: Intelligence, Control, and Information. Prentice Hall.
Yu, H. and Bertsekas, D. P. (2006). Convergence results for some temporal difference methods based on least-squares. Technical Report LIDS 2697, Massachusetts Institute of Technology, Cambridge, US. Available at http://www.mit.edu/people/dimitrib/lspe lids final.pdf.
Yu, H. and Bertsekas, D. P. (2009). Convergence results for some temporal difference methods based on least squares. IEEE Transactions on Automatic Control, 54(7):1515–1531.