Supplementary Materials: Robust Asymmetric Learning in POMDPs

Andrew Warrington,J. Wilder Lavington,Frank Wood, ́ Adam,Mark Schmidt

semanticscholar（2021）

引用 0|浏览2

暂无评分

摘要

t Time Discrete time step Z Discrete time step used in integration. Indexes other values. st State Full state, compact S = R State space of the MDP. Sufficient to fully define state of the environment. state, omniscient state ot Observation Partial observation RA×B×... O = Observed value in POMDP, emitted conditional on state. State is generally not identifiable from observation. Conditionally dependent only on state. at Action A = R Interaction made with the environment at time t. rt Reward R Value received at time t indicating performance. Maximising sum of rewards is the objective. bt Belief state B qπ Trajectory distribution Q : Π → (A×B ×O ×S × R) Process of sampling trajectories using the policy π. If the process is fully observed O = ∅. τ0:t Trajectory Rollouts × R) (A × B × O × S Sequence of tuples containing state, next state, observation, action and reward.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要