NEWS
SCIENCE science.org
ployment or when probed carefully. “They
fall apart in embarrassing ways pretty eas-
ily,” Bowman says.
By strategically adding stickers to a
stop sign, for example, researchers in 2018
fooled standard image recognition systems
into seeing a speed limit sign instead. And
a 2018 project called Gender Shades found
the accuracy of gender identification for
commercial face-recognition systems
dropped from 90% to 65% for dark-skinned
women’s faces. “I really don’t know if we’re
prepared to deploy these systems,” says
Deborah Raji, a computer scientist at
Mozilla who collaborated on a follow-up to
the original Gender Shades paper.
Natural language processing (NLP) mod-
els can be fickle, too. In 2020, Marco Túlio
Ribeiro, a computer scientist at Microsoft,
and his colleagues reported many hidden
bugs in top models, including those from
Microsoft, Google, and Amazon. Many give
wildly different outputs after small tweaks
to their inputs, such as replacing a word
with a synonym, or asking “what’s” versus
“what is.” When commercial models were
tasked with evaluating a statement that in-
cluded a negation at the end (“I thought the
plane [ride] would be awful, but it wasn’t”),
they almost always got the sense of the sen-
tence wrong, Ribeiro says. “A lot of people
did not imagine that these state-of-the-art
models could be so bad.”
The solution, most researchers argue, is
not to abandon benchmarks, but to make
them better. Some want to make the tests
tougher, whereas others want to use them
to illuminate biases. Still others want to
broaden benchmarks so they present ques-
tions that have no single correct answer,
or measure performance on more than one
metric. The AI field is starting to value
the unglamorous work of developing the
training and test data that make up bench-
marks, says Bowman, who has now con-
structed more than a dozen of them. “Data
work is changing quite a bit,” he says. “It’s
gaining legitimacy.”
THE MOST OBVIOUS path to improving
benchmarks is to keep making them harder.
Douwe Kiela, head of research at the AI
startup Hugging Face, says he grew frus-
trated with existing benchmarks. “Bench-
NEWNEWSS
THE TEST
6 MAY 2022 • VOL 376 ISSUE 6593 571
AI software clears high
hurdles on IQ tests
but still makes dumb
mistakes. Can better
benchmarks help?
By Matthew Hutson
0506NewsFeature_15546831.indd 571 5/2/22 8:23 PM