Supplementary Materials: Robust Asymmetric Learning in POMDPs

semanticscholar(2021)

引用 0|浏览2
暂无评分
摘要
t Time Discrete time step Z Discrete time step used in integration. Indexes other values. st State Full state, compact S = R State space of the MDP. Sufficient to fully define state of the environment. state, omniscient state ot Observation Partial observation RA×B×... O = Observed value in POMDP, emitted conditional on state. State is generally not identifiable from observation. Conditionally dependent only on state. at Action A = R Interaction made with the environment at time t. rt Reward R Value received at time t indicating performance. Maximising sum of rewards is the objective. bt Belief state B qπ Trajectory distribution Q : Π → (A×B ×O ×S × R) Process of sampling trajectories using the policy π. If the process is fully observed O = ∅. τ0:t Trajectory Rollouts × R) (A × B × O × S Sequence of tuples containing state, next state, observation, action and reward.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要