Science - USA (2019-08-30)

(Antfer) #1
INSIGHTS | PERSPECTIVES

sciencemag.org SCIENCE

GRAPHIC: A. KITTERMAN/

SCIENCE

By Alan Blair and Abdallah Saffidine

S

uperhuman performance by artificial
intelligence (AI) has been demon-
strated in two-player, deterministic,
zero-sum, perfect-information games
( 1 ) such as chess, checkers ( 2 ), Hex,
and Go ( 3 ). Research using AI has
broadened to include games with challenging
attributes such as randomness, multiple play-
ers, or imperfect information. Randomness
is a feature of dice games, and card games
include the additional complexity that each
player sees some cards that are hidden from
others. These aspects more closely resemble
real-world situations, and this research may
thus lead to algorithms with wider applica-
bility. On page 885 of this issue, Brown and
Sandholm ( 4 ) show that a new computer
player called Pluribus exceeds human perfor-
mance for six-player Texas hold’em poker.
Texas hold’em is a variant of poker where
each player has two hidden cards and five
communal cards from which to make the
best poker hand. Scientists have previously
demonstrated superhuman performance
for two-player versions of the game: limit
hold’em, where bets are capped ( 5 ), and no-
limit hold’em, where they are not capped ( 6 ).
Exceeding human performance in the six-
player version had remained a challenge.
Games with three or more players repre-
sent a challenge for game theory. For two
player, zero-sum games, a strategy exists
where no player can improve their chances
by switching to a different strategy. This
so-called Nash equilibrium is considered
a solution to the game. For multiplayer
games, the expected reward may vary from
one Nash equilibrium to another. Fast algo-
rithms guaranteed to converge to a Nash
equilibrium, such as counterfactual regret
minimization (CFR), may fail to do so in
multiplayer games. Nevertheless, CFR still
shows promising empirical performance in
some multiplayer domains.
Pluribus first learns a generic or “blue-
print” strategy by self-play. Then, during ac-
tual play, it computes a real-time strategy to
refine the blueprint strategy, depending on
the current state of the game. The program
learns the blueprint strategy through a CFR

variant called Monte Carlo CFR (MCCFR) ( 7 ),
with a number of enhancements. Pluribus re-
peatedly simulates hands of poker in which
all players use the same strategy; after each
hand, it recursively examines each decision
and estimates the expected outcome from
that decision, compared to other possible
actions that could have been chosen in the
same situation.
To improve the efficiency of MCCFR in Plu-
ribus, the authors introduced linear weighted
discounting in the early stages of training,
and strategic pruning of negative-regret ac-
tions in the later stages. The most intricate
part of the system is the real-time strategy
component. To deal with imperfect infor-
mation, Pluribus performs a nested search,

maintaining a probability distribution of root
nodes for the search tree and of the cards
held by each player, conditioned on the as-
sumption that all players are using the same
(known) strategy. To evaluate leaf nodes ro-
bustly, Pluribus considers four different vari-
ants of the blueprint strategy.
For large and complex games, abstrac-
tion of states and actions can be used to re-
strain the growth of the search tree ( 8 ). This
is necessary for the full six-player no-limit
Texas hold’em game, which is too complex
to search directly. Instead, Pluribus simu-
lates a simpler version of the game by buck-
eting similar decision points together and
eliminating some actions (see the figure).
For example, the blueprint strategy considers

COMPUTER SCIENCE

AI surpasses humans at six-player poker


Self-learning Pluribus beats five humans in Texas hold’em showdown


School of Computer Science and Engineering, University of
New South Wales, Sydney 2052, Australia. Email: blair@cse.
unsw.edu.au; [email protected]

Solving feasible

Abstraction

Direct solving intractable

Real game
Pluribus needs an
action (call, raise,
or fold) for each
scenario.

Abstract game
Similar scenarios,
like the 9-high and
10-high straights,
are bucketed
together.

Abstract
strategy
Pluribus uses
MCCFR to map
each bucket to
a distribution
over actions.

Real strategy
Each scenario is
mapped to a
distribution over
actions based
on the abstract
strategy for its
bucket.

Translation

(^2) (^6) 29  69
(^7)  (^87)
 8
KK (^2)  (^62) (^9)  69
(^7)  (^87)
 8
(^8)  8 2  (^62) (^9)  69
(^7)  (^87)
 8
(^5)  5 2  (^6) 29  69
(^7)  (^87)
 8
(^10)  10
(^2)  (^6) 29  69
(^7)  (^87)  8
KK (^2)  (^62) (^9)  69
(^7)  (^87)  8
(^8)  8 2  (^62) (^9)  69
(^7)  (^87)  8
Call (70%)
Raise $100 (30%)
Fo l d ( 9 0 % )
Raise $100 (10%)
(^5)  5 2  (^6) 29  69
(^7)  (^87)  8
(^10) 
10
(^2)  (^6) 29  69
(^7)  (^87)
 8
KK (^2)  (^62) (^9)  69
(^7)  (^87)
 8
(^8)  8 2  (^62) (^9)  69
(^7)  (^87)
 8
(^5)  5 2  (^6) 29  69
(^7)  (^87)
 8
(^10)  10
$200 $ 200 $200 $200
(^2)  (^6) 29  69
(^7)  (^87)  8
K
K^2 ^62 ^9  69
(^7)  (^87)  8
(^8)  (^82)  (^6) 29  69
(^7)  (^87)  8
(^5)  5 2  (^62) (^9)  69
(^7)  (^87)  8
(^10) 
$200 $ 200 $200 10 $200
$200 $200 $200 $200
$200 $200 $200 $200
Call (50%)
Raise $100 (50%)
Call (70%)
Raise $100 (30%)
Call (70%)
Raise $100 (30%)
Fo l d ( 9 0 % )
Raise $100 (10%)
Call (50%)
Raise $100 (50%)
Abstraction in game tree search
In its abstraction mechanism, Pluribus shrinks the number of decision points on whether to call, raise, or
fold by bucketing similar situations together. This reduces the game tree search complexity in poker from an
intractable to a solvable problem, using Monte Carlo counterfactual regret minimization (MCCFR).
864 30 AUGUST 2019 • VOL 365 ISSUE 6456
Published by AAAS

Free download pdf