Science Journals — AAAS

Fei Zhao,Heather L. Franco,Karina F. Rodriguez,Paula R. Brown,Ming-Jer Tsai,Sophia Y. Tsai,Humphrey H.-C. Yao

semanticscholar（2016）

引用 0|浏览8

暂无评分

摘要

ion for large imperfectinformation games There are far toomany decision points in no-limit Texas hold’em to reason about individually. To reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction (24, 25). After abstraction, the bucketed decision points are treated as identical. We use two kinds of abstraction inPluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. No-limit Texas hold’em normally allows any whole-dollar bet between $100 and $10,000. However, in practice there is little difference between betting $200 and betting $201. To reduce the complexity of forming a strategy, Pluribus only considers a few different bet sizes at any given decision point. The exact number of bets it considers varies between 1 and 14 depending on the situation. Although Pluribus can limit itself to only betting one of a few different sizes between $100 and $10,000, when actually playing no-limit poker, the opponents are not constrained to those few options. What happens if an opponent bets $150 while Pluribus has only been trained to consider bets of $100 or $200? Generally, Pluribus Brown et al., Science 365, 885–890 (2019) 30 August 2019 2 of 6 Fig. 2. A game tree traversal via Monte Carlo CFR. In this figure, player P1 is traversing the game tree. (Left) A game is simulated until an outcome is reached. (Middle) For each P1 decision point encountered in the simulation in the left panel, P1 explores each other action that P1 could have taken and plays out a simulation to the end of the game. P1 then updates its strategy to pick actions with higher payoff with higher probability. (Right) P1 explores each other action that P1 could have taken at every new decision point encountered in the middle panel, and P1 updates its strategy at those hypothetical decision points. This process repeats until no new P1 decision points are encountered, which in this case is after three steps but in general may be more. Our implementation of MCCFR (described in the supplementary materials) is equivalent but traverses the game tree in a depth-first manner. (The percentages in the figure are for illustration purposes only and may not correspond to actual percentages that the algorithm would compute.) RESEARCH | RESEARCH ARTICLE on S etem er 2, 2019 http://sce.sciencem agorg/ D ow nladed fom will rely on its search algorithm (described in a later section) to compute a response in real time to such “off-tree” actions. The other form of abstraction that we use in Pluribus is information abstraction, in which decision points that are similar in terms of what information has been revealed (in poker, the player’s cards and revealed board cards) are bucketed together and treated identically (26–28). For example, a 10-high straight and a 9-high straight are distinct hands but are nevertheless strategically similar. Pluribus may bucket these hands together and treat them identically, thereby reducing the number of distinct situations for which it needs to determine a strategy. Information abstraction drastically reduces the complexity of the game, but it may wash away subtle differences that are important for superhuman performance. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. Information abstraction is also applied during offline self-play. Self-play through improved Monte Carlo counterfactual regret minimization The blueprint strategy in Pluribus was computed using a variant of counterfactual regret minimization (CFR) (29). CFR is an iterative self-play algorithm in which the AI starts by playing completely at random but gradually improves by learning to beat earlier versions of itself. Every competitive Texas hold’em AI for at least the past 6 years has computed its strategy using some variant of CFR (4–6, 23, 28, 30–34). We use a form of Monte Carlo CFR (MCCFR) that samples actions in the game tree rather than traversing the entire game tree on each iteration (33, 35–37). On each iteration of the algorithm, MCCFR designates one player as the traverser whose current strategy is updated on the iteration. At the start of the iteration, MCCFR simulates a hand of poker based on the current strategy of all players (which is initially completely random). Once the simulated hand is completed, the AI reviews each decision that was made by the traverser and investigates how much better or worse it would have done by choosing the other available actions instead. Next, the AI reviews each hypothetical decision that would have been made following those other available actions and investigates how much better it would have done by choosing the other available actions, and so on. This traversal of the game tree is illustrated in Fig. 2. Exploring other hypothetical outcomes is possible because the AI knows each player’s strategy for the iteration and can therefore simulate what would have happened had some other action been chosen instead. This counterfactual reasoning is one of the features that distinguishes CFR from other self-play algorithms that have been deployed in domains such as Go (9), Dota 2 (20), and StarCraft 2 (21). The difference between what the traverser would have received for choosing an action versus what the traverser actually achieved (in expectation) on the iteration is added to the counterfactual regret for the action. Counterfactual regret represents how much the traverser regrets having not chosen that action in previous iterations. At the end of the iteration, the traverser’s strategy is updated so that actions with higher counterfactual regret are chosen with higher probability. For two-player zero-sum games, CFR guarantees that the average strategy played over all iterations converges to a Nash equilibrium, but convergence to a Nash equilibrium is not guaranteed outside of two-player zero-sum games. Nevertheless, CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. This, in turn, guarantees in the limit that the average performance of CFR on each iteration that was playedmatches the average performance of the best single fixed strategy in hindsight. CFR is also proven to eliminate iteratively strictly dominated actions in all finite games (23). Because the difference between counterfactual value and expected value is added to counterfactual regret rather than replacing it, the first iteration in which the agent played completely randomly (which is typically a very bad strategy) still influences the counterfactual regrets, and therefore the strategy that is played, for iterations far into the future. In the original form of CFR, the influence of this first iteration decays at a rate of 1 T, whereT is the number of iterations played. To more quickly decay the influence of these early “bad” iterations, Pluribus uses a recent form of CFR called Linear CFR (38) in early iterations. (We stop the discounting after that because the time cost of doing the multiplications with the discount factor is not worth the benefit later on.) Linear CFR assigns a weight of T to the regret contributions of iteration T . Therefore, the influence of the first iteration decays at a rate of 1 PT t1⁄41 t 1⁄4 2 T ðTþ1Þ . This leads to the strategy improving more quickly in practice while still maintaining a near-identical worst-case bound on total regret. To speed up the blueprint strategy computation even further, actions with extremely negative regret are not explored in 95% of iterations. The blueprint strategy for Pluribus was computed in 8 days on a 64-core server for a total of 12,400 CPU core hours. It required less than 512 GB of memory. At current cloud computing spot instance rates, this would cost about $144 to produce. This is in sharp contrast to all the other recent superhumanAImilestones for games, which used large numbers of servers and/or farms of graphics processingunits (GPUs).Morememory Brown et al., Science 365, 885–890 (2019) 30 August 2019 3 of 6 Fig. 3. Perfect-information game search in Rock-Paper-Scissors. (Top) A sequential representation of Rock-Paper-Scissors in which player 1 acts first but does not reveal her action to player 2, who acts second. The dashed lines between the player 2 nodes signify that player 2 does not know which of those nodes he is in. The terminal values are shown only for player 1. (Bottom) A depiction of the depth-limited subgame if player 1 conducts search (with a depth of one) using the same approach as is used in perfect-information games. The approach assumes that after each action, player 2 will play according to the Nash equilibrium strategy of choosing Rock, Paper, and Scissors with 1 3 probability each. This results in a value of zero for player 1 regardless of her strategy. RESEARCH | RESEARCH ARTICLE on S etem er 2, 2019 http://sce.sciencem agorg/ D ow nladed fom

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要