Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Abstract:
Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact ...More
Code:
Data:
Full Text
Tags
Comments