AI helps you reading Science

AI generates interpretation videos

AI extracts and analyses the key points of the paper to generate videos automatically


pub
Go Generating

AI Traceability

AI parses the academic lineage of this thesis


Master Reading Tree
Generate MRT

AI Insight

AI extracts a summary of this paper


Weibo:
Werbos has previously argued for the general idea of building AI systems that approximate dynamic programming, and Whitehead and others have presented results for reinforcement learning systems augmented with with an action model used for planning

Dyna, an integrated architecture for learning, planning, and reacting

SIGART Bulletin, no. 4 (1991): 160-163

Cited: 280|Views74
EI

Abstract

Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agent's actions on the world. Planning is incremental and can use the probabilistic and ofttimes incorrect world models generated by lea...More

Code:

Data:

Introduction
  • Introduction to Dyna

    The Dyna architecture attempts to integrate

    Trial-and-error learning of an optimal reactive policy, a mapping from situations to actions; Learning of domain knowledge in the form of an action model, a black box that takes as input a situation and action and outputs a prediction of the immediate situation; Planning: nding the optimal reactive policy given domain knowledge; Reactive execution: No planning intervenes between perceiving a situation and responding to it.

    In addition, the Dyna architecture is speci cally designed for the case in which the agent does not have complete and accurate knowledge of the e ects of its actions on the world and in which those e ects may be nondeterministic.
  • The agent's objective is to choose actions so as to maximize the total reward it receives in the long-term.1 This problem formulation has been used in studies of reinforcement learning for many years and is being used in studies of planning and reactive systems (e.g., Russell, 1989).
  • Repeat K times: 5.1 Choose a hypothetical world state and action; 5.2 Predict resultant reward and new state using action model; 5.3 Apply reinforcement learning to this hypothetical experience.
Highlights
  • Introduction to Dyna

    The Dyna architecture attempts to integrate

    Trial-and-error learning of an optimal reactive policy, a mapping from situations to actions; Learning of domain knowledge in the form of an action model, a black box that takes as input a situation and action and outputs a prediction of the immediate situation; Planning: nding the optimal reactive policy given domain knowledge; Reactive execution: No planning intervenes between perceiving a situation and responding to it.

    In addition, the Dyna architecture is speci cally designed for the case in which the agent does not have complete and accurate knowledge of the e ects of its actions on the world and in which those e ects may be nondeterministic
  • Dyna assumes the agent's task can be formulated as a reward maximization problem (Figure 1)
  • The agent's objective is to choose actions so as to maximize the total reward it receives in the long-term.1. This problem formulation has been used in studies of reinforcement learning for many years and is being used in studies of planning and reactive systems (e.g., Russell, 1989)
  • Werbos (1987) has previously argued for the general idea of building AI systems that approximate dynamic programming, and Whitehead (1989) and others have presented results for reinforcement learning systems augmented with with an action model used for planning
  • Riolo (1991) and Grefenstette et al (1990) have explored in di erent ways the use of action models together with reinforcement learning methods based on classi er systems
  • Results from dynamic programming (Bertsekas & Tsitsiklis, 1989) can be adapted to show that IDP planning based on the tabular version of Q-learning converges onto the optimal behavior given the action model
Results
  • Given enough experience, such an agent can learn the optimal reactive mapping from situations to action.
  • Werbos (1987) has previously argued for the general idea of building AI systems that approximate dynamic programming, and Whitehead (1989) and others have presented results for reinforcement learning systems augmented with with an action model used for planning.
  • Riolo (1991) and Grefenstette et al (1990) have explored in di erent ways the use of action models together with reinforcement learning methods based on classi er systems.
  • Instantiating the Dyna architecture involves selecting three major components: The structure of the action model and its learning algorithms; An algorithm for selecting hypothetical states and actions (Step 5.1, search control).
  • A reinforcement learning method, including a learningfrom-examples algorithm and a way of generating variety in behavior.
  • The update algorithm for Q-learning can be expressed in a general form as a way of moving from a unit of experience to a training example for the evaluation function.
  • Results from dynamic programming (Bertsekas & Tsitsiklis, 1989) can be adapted to show that IDP planning based on the tabular version of Q-learning converges onto the optimal behavior given the action model.
  • Dyna is fully reactive in the sense that no planning intervenes between observing a state and taking an action dependent
Conclusion
  • The state space is obviously far too large for table-based approaches, and Dyna must rely on methods for learning and generalizing from examples.
  • In the Dyna algorithm given in Figure 2, IDP planning takes place after action selection, but conceptually these processes proceed in parallel.2 The critical issue is that planning and reacting processes are not strongly coupled: the agent never delays responding to a situation in order to plan a response to it.
  • The object of the planning and learning processes are to learn one policy function that maps states to actions with no explicitgoal' input.
Reference
  • Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1990) Learning and sequential decision making. In Learning and
    Google ScholarLocate open access versionFindings
  • Computational Neuroscience, M. Gabriel and J.W. Moore (Eds.), 539{602, MIT Press.
    Google ScholarFindings
  • Bertsekas, D. P. (1987) Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall. Bertsekas, D. P. & Tsitsiklis, J. N. (1989) Parallel Distributed Processing: Numerical Methods, Prentice-Hall. Craik, K. J. W. (1943) The Nature of Explanation. Cambridge University Press, Cambridge, UK. Dennett, D. C. (1978) Why the law of e ect will not go away. In Brainstorms, by D. C. Dennett, 71{89, Bradford Books.
    Google ScholarLocate open access versionFindings
  • Grefenstette, J. J., Ramsey, C. L., & Schultz, A. C. (1990) Learning sequential decision rules using simulation models and competition. Machine Learning 5, 355{382.
    Google ScholarLocate open access versionFindings
  • Holland, J. H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. Michalski, J. Carbonell & T. Mitchell, Eds., Machine learning II, Morgan Kaufmann.
    Google ScholarFindings
  • Kaelbling, L. P. (1990) Learning in Embedded Systems. Ph.D. thesis, Stanford University.
    Google ScholarFindings
  • Korf, R. E. (1990) Real-Time Heuristic Search. Arti cial Intelligence 42: 189{211. Lin, Long-Ji. (1991) Self-improving reactive agents: Case studies of reinforcement learning frameworks. In: Proceedings of the International Conference on the Simulation of Adaptive Behavior, 297{305, MIT Press.
    Google ScholarLocate open access versionFindings
  • Mahadevan, S. & Connell, J. (1990) Automatic programming of behavior-based robots using reinforcement learning. IBM technical report.
    Google ScholarFindings
  • Riolo, R. (1991) Lookahead planning and latent learning in a classi er system. In: Proceedings of the International Conference on the Simulation of Adaptive Behavior, MIT Press.
    Google ScholarLocate open access versionFindings
  • Russell, S. J. (1989) Execution architectures and compilation. Proceedings of IJCAI-89, 15{20.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. (1984) Temporal credit assignment in reinforcement learning. PhD thesis, COINS Dept., Univ. of Mass., Amherst, MA 01003.
    Google ScholarFindings
  • Sutton, R.S. (1988) Learning to predict by the methods of temporal di erences. Machine Learning 3: 9{44.
    Google ScholarLocate open access versionFindings
  • Sutton, R. S. (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 216{224.
    Google ScholarLocate open access versionFindings
  • Sutton, R.S., Barto, A.G. (1981) An adaptive network that constructs and uses an internal model of its environment. Cognition and Brain Theory Quarterly 4: 217{246.
    Google ScholarLocate open access versionFindings
  • Watkins, C. J. C. H. (1989) Learning with Delayed Rewards. PhD thesis, Cambridge University Psychology Department.
    Google ScholarFindings
  • Werbos, P. J. (1987) Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, SMC-17, No. 1, 7{20.
    Google ScholarLocate open access versionFindings
  • Whitehead, S. D., Ballard, D.H. (1991) Learning to perceive and act by trial and error. Machine Learning 7:, 45-83.
    Google ScholarLocate open access versionFindings
  • Whitehead, S. D. (1989) Scaling reinforcement learning systems. Technical Report 304, Dept. of Computer Science, University of Rochester, Rochester, NY 14627.
    Google ScholarFindings
0
Your rating :

No Ratings

Tags
Comments
avatar
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn