Batched Mean-Variance Bandits.

ICPR(2022)

引用 0|浏览5
暂无评分
摘要
The issue of exploration-related risk has been the focus of recent research on bandit decision making in addition to the cumulative rewards given the ubiquity of uncertainty in sequential decision making. Whereas most existing risk-averse sequential decision making algorithms are fully-sequential assume that the player can switch actions at every time step, batched policies, under which players need to specify which actions to take in batches, are more relevant in real-time applications such as stock investments and experimental trials because the rewards and risks involved are often not immediately observable right after actions are taken. Despite such relevance, the effect of batched policies on risk-averse bandits has not been well studied. Using the common mean-variance risk measure as the risk criteria, we prove that O(log n) batches suffice to help the batched bandit algorithm attain the optimal instance-dependent regret upper bound of O(log n), where n is the length of time horizon. We also empirically demonstrate the effectiveness of our algorithms over different bandit instances, thereby providing insight on balancing risk and return for batched policies in multiarmed bandits. Finally, we point out several research directions for future work starting from our analysis.
更多
查看译文
关键词
balancing risk,bandit decision making,batched bandit algorithm,batched mean-variance bandits,batched policies,common mean-variance risk measure,cumulative rewards,different bandit instances,existing risk-averse sequential decision making algorithms,experimental trials,exploration-related risk,multiarmed bandits,real-time applications,risk criteria,risk-averse bandits,stock investments
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要