Science - USA (2019-08-30)

INSIGHTS | PERSPECTIVES

sciencemag.org SCIENCE

GRAPHIC: A. KITTERMAN/

SCIENCE

By Alan Blair and Abdallah Saffidine

S

uperhuman performance by artificial intelligence (AI) has been demonstrated in two-player, deterministic, zero-sum, perfect-information games ( 1 ) such as chess, checkers ( 2 ), Hex, and Go ( 3 ). Research using AI has broadened to include games with challenging attributes such as randomness, multiple players, or imperfect information. Randomness is a feature of dice games, and card games include the additional complexity that each player sees some cards that are hidden from others. These aspects more closely resemble real-world situations, and this research may thus lead to algorithms with wider applica- bility. On page 885 of this issue, Brown and Sandholm ( 4 ) show that a new computer player called Pluribus exceeds human performance for six-player Texas hold’em poker. Texas hold’em is a variant of poker where each player has two hidden cards and five communal cards from which to make the best poker hand. Scientists have previously demonstrated superhuman performance for two-player versions of the game: limit hold’em, where bets are capped ( 5 ), and no- limit hold’em, where they are not capped ( 6 ). Exceeding human performance in the six- player version had remained a challenge. Games with three or more players repre- sent a challenge for game theory. For two player, zero-sum games, a strategy exists where no player can improve their chances by switching to a different strategy. This so-called Nash equilibrium is considered a solution to the game. For multiplayer games, the expected reward may vary from one Nash equilibrium to another. Fast algorithms guaranteed to converge to a Nash equilibrium, such as counterfactual regret minimization (CFR), may fail to do so in multiplayer games. Nevertheless, CFR still shows promising empirical performance in some multiplayer domains. Pluribus first learns a generic or “blueprint” strategy by self-play. Then, during ac- tual play, it computes a real-time strategy to refine the blueprint strategy, depending on the current state of the game. The program learns the blueprint strategy through a CFR

variant called Monte Carlo CFR (MCCFR) ( 7 ), with a number of enhancements. Pluribus re- peatedly simulates hands of poker in which all players use the same strategy; after each hand, it recursively examines each decision and estimates the expected outcome from that decision, compared to other possible actions that could have been chosen in the same situation. To improve the efficiency of MCCFR in Plu- ribus, the authors introduced linear weighted discounting in the early stages of training, and strategic pruning of negative-regret actions in the later stages. The most intricate part of the system is the real-time strategy component. To deal with imperfect information, Pluribus performs a nested search,

maintaining a probability distribution of root nodes for the search tree and of the cards held by each player, conditioned on the as- sumption that all players are using the same (known) strategy. To evaluate leaf nodes ro- bustly, Pluribus considers four different vari- ants of the blueprint strategy. For large and complex games, abstraction of states and actions can be used to re- strain the growth of the search tree ( 8 ). This is necessary for the full six-player no-limit Texas hold’em game, which is too complex to search directly. Instead, Pluribus simulates a simpler version of the game by bucketing similar decision points together and eliminating some actions (see the figure). For example, the blueprint strategy considers

COMPUTER SCIENCE

AI surpasses humans at six-player poker

Self-learning Pluribus beats five humans in Texas hold’em showdown

School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia. Email: blair@cse. unsw.edu.au; [email protected]

Solving feasible

Abstraction

Direct solving intractable

Real game Pluribus needs an action (call, raise, or fold) for each scenario.

Abstract game Similar scenarios, like the 9-high and 10-high straights, are bucketed together.

Abstract strategy Pluribus uses MCCFR to map each bucket to a distribution over actions.

Real strategy Each scenario is mapped to a distribution over actions based on the abstract strategy for its bucket.

Translation

(^2) (^6) 29 69
(^7) (^87)
8
KK (^2) (^62) (^9) 69
(^7) (^87)
8
(^8) 8 2 (^62) (^9) 69
(^7) (^87)
8
(^5) 5 2 (^6) 29 69
(^7) (^87)
8
(^10) 10
(^2) (^6) 29 69
(^7) (^87) 8
KK (^2) (^62) (^9) 69
(^7) (^87) 8
(^8) 8 2 (^62) (^9) 69
(^7) (^87) 8
Call (70%)
Raise $100 (30%)
Fo l d ( 9 0 % )
Raise $100 (10%)
(^5) 5 2 (^6) 29 69
(^7) (^87) 8
(^10)
10
(^2) (^6) 29 69
(^7) (^87)
8
KK (^2) (^62) (^9) 69
(^7) (^87)
8
(^8) 8 2 (^62) (^9) 69
(^7) (^87)
8
(^5) 5 2 (^6) 29 69
(^7) (^87)
8
(^10) 10
$200 $ 200 $200 $200
(^2) (^6) 29 69
(^7) (^87) 8
K
K^2 ^62 ^9 69
(^7) (^87) 8
(^8) (^82) (^6) 29 69
(^7) (^87) 8
(^5) 5 2 (^62) (^9) 69
(^7) (^87) 8
(^10)
$200 $ 200 $200 10 $200
$200 $200 $200 $200
$200 $200 $200 $200
Call (50%)
Raise $100 (50%)
Call (70%)
Raise $100 (30%)
Call (70%)
Raise $100 (30%)
Fo l d ( 9 0 % )
Raise $100 (10%)
Call (50%)
Raise $100 (50%)
Abstraction in game tree search
In its abstraction mechanism, Pluribus shrinks the number of decision points on whether to call, raise, or
fold by bucketing similar situations together. This reduces the game tree search complexity in poker from an
intractable to a solvable problem, using Monte Carlo counterfactual regret minimization (MCCFR).
864 30 AUGUST 2019 • VOL 365 ISSUE 6456
Published by AAAS

Science - USA (2019-08-30)

S

Get our desktop app

Company

Features

Documentation

Resources