Neural Comp. NEW Faster Access
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Doya, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Doya, K.
(Neural Computation. 2000;12:219-245.)
© 2000 The MIT Press


Letter

Reinforcement Learning in Continuous Time and Space

Kenji Doya

ATR Human Information Processing Research Laboratories, Soraku, Kyoto 619-0288, Japan

This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD({lambda}) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework.

The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.




This article has been cited by other articles:


Home page
The International Journal of Robotics ResearchHome page
G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng
Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot
The International Journal of Robotics Research, February 1, 2008; 27(2): 213 - 228.
[Abstract] [PDF]


Home page
Ann. N. Y. Acad. Sci.Home page
A. D. REDISH and A. JOHNSON
A Computational Model of Craving and Obsession
Ann. N.Y. Acad. Sci., May 1, 2007; 1104(1): 324 - 339.
[Abstract] [Full Text] [PDF]


Home page
Neural Comput.Home page
J. Morimoto and K. Doya
Reinforcement learning state estimator.
Neural Comput., March 1, 2007; 19(3): 730 - 756.
[Abstract] [Full Text] [PDF]


Home page
Adaptive BehaviorHome page
M. Khamassi, L. Lacheze, B. Girard, A. Berthoz, and A. Guillot
Actor-Critic Models of Reinforcement Learning in the Basal Ganglia: From Natural to Artificial Rats
Adaptive Behavior, June 1, 2005; 13(2): 131 - 148.
[Abstract] [PDF]


Home page
Neural Comput.Home page
F. Worgotter and B. Porr
Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms
Neural Comput., February 1, 2005; 17(2): 245 - 319.
[Abstract] [Full Text] [PDF]


Home page
Neural Comput.Home page
J. Morimoto and K. Doya
Robust Reinforcement Learning
Neural Comput., February 1, 2005; 17(2): 335 - 359.
[Abstract] [Full Text] [PDF]


Home page
Neural Comput.Home page
R. P. N. Rao and T. J. Sejnowski
Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning
Neural Comput., October 1, 2001; 13(10): 2221 - 2237.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
J COGNITIVE NEUROSCIENCE NEURAL COMPUTATION MIT PRESS JOURNALS
Copyright © 2000 by The MIT Press.