Science - USA (2022-05-06)

(EriveltonMoraes) #1
NEWS

SCIENCE science.org

ployment or when probed carefully. “They
fall apart in embarrassing ways pretty eas-
ily,” Bowman says.
By strategically adding stickers to a
stop sign, for example, researchers in 2018
fooled standard image recognition systems
into seeing a speed limit sign instead. And
a 2018 project called Gender Shades found
the accuracy of gender identification for
commercial face-recognition systems
dropped from 90% to 65% for dark-skinned
women’s faces. “I really don’t know if we’re
prepared to deploy these systems,” says
Deborah Raji, a computer scientist at
Mozilla who collaborated on a follow-up to
the original Gender Shades paper.
Natural language processing (NLP) mod-
els can be fickle, too. In 2020, Marco Túlio

Ribeiro, a computer scientist at Microsoft,
and his colleagues reported many hidden
bugs in top models, including those from
Microsoft, Google, and Amazon. Many give
wildly different outputs after small tweaks
to their inputs, such as replacing a word
with a synonym, or asking “what’s” versus
“what is.” When commercial models were
tasked with evaluating a statement that in-
cluded a negation at the end (“I thought the
plane [ride] would be awful, but it wasn’t”),
they almost always got the sense of the sen-
tence wrong, Ribeiro says. “A lot of people
did not imagine that these state-of-the-art
models could be so bad.”
The solution, most researchers argue, is
not to abandon benchmarks, but to make
them better. Some want to make the tests

tougher, whereas others want to use them
to illuminate biases. Still others want to
broaden benchmarks so they present ques-
tions that have no single correct answer,
or measure performance on more than one
metric. The AI field is starting to value
the unglamorous work of developing the
training and test data that make up bench-
marks, says Bowman, who has now con-
structed more than a dozen of them. “Data
work is changing quite a bit,” he says. “It’s
gaining legitimacy.”

THE MOST OBVIOUS path to improving
benchmarks is to keep making them harder.
Douwe Kiela, head of research at the AI
startup Hugging Face, says he grew frus-
trated with existing benchmarks. “Bench-

NEWNEWSS

THE TEST


6 MAY 2022 • VOL 376 ISSUE 6593 571

AI software clears high


hurdles on IQ tests


but still makes dumb


mistakes. Can better


benchmarks help?


By Matthew Hutson

0506NewsFeature_15546831.indd 571 5/2/22 8:23 PM
Free download pdf