SCIENCE science.org
of her benchmarks, TuringAdvice, does just
that—asking models to answer requests for
advice posted on Reddit. So far, however, re-
sults are not spectacular—the AI responses
only beat human responses about 15% of the
t i m e. “ I t ’s k i n d o f a n o v e r l y a m b i t i o u s l e a d e r -
board,” she says. “Nobody actually wants to
work on it, because it’s depressing.”
Bowman has a different approach to clos-
ing off shortcuts. For his latest benchmark,
posted online in December 2021 and called
QuALITY (Question Answering with Long
Input Texts, Yes!), he hired crowdworkers to
generate questions about text passages from
short stories and nonfiction articles. He hired
another group to answer the questions after
reading the passages at their
own pace, and a third group to
answer them hurriedly under a
strict time limit. The benchmark
consists of questions that the
careful readers could answer but
the rushed ones couldn’t; it leaves
few shortcuts for an AI.
BETTER BENCHMARKS are only
one part of the solution, research-
ers say. Developers also need
to avoid obsessing over scores.
Joaquin Vanschoren, a computer
scientist at Eindhoven Univer-
sity of Technology, decries the
emphasis on being “state of the
art” (SOTA)—sitting on top of a
leaderboard—and says “SOTA
chasing” is stifling innovation.
He wants the reviewers who act
as gatekeepers at AI conferences
to de-emphasize scores, and envi-
sions a “not-state-of-the-art track,
or something like that, where you
focus on novelty.”
The pursuit of high scores can
lead to the AI equivalent of dop-
ing. Researchers often tweak
and juice the models with spe-
cial software settings or hard-
ware that can vary from run to run on
the benchmark, resulting in model perfor-
mances that aren’t reproducible in the real
world. Worse, researchers tend to cherry-
pick among similar benchmarks until they
find one where their model comes out on
top, Vanschoren says. “Every paper has a
new method that outperforms all the other
ones, which is theoretically impossible,”
he says. To combat the cherry-picking,
Vanschoren’s team recently co-created
OpenML Benchmarking Suites, which bun-
dles benchmarks and compiles detailed
performance results across them. It might
be easy to tailor a model for a particular
benchmark, but far harder to tune for doz-
ens of benchmarks at once.
Another problem with scores is that
one number, such as accuracy, doesn’t
tell you everything. Kiela recently re-
leased Dynaboard—a sort of companion to
Dynabench. It reports a model’s “Dyna-
score,” its performance on a benchmark
across a variety of factors: accuracy, speed,
memory usage, fairness, and robustness to
input tweaks. Users can weight the factors
that matter most for them. Kiela says an
engineer at Facebook might value accuracy
more than a smartwatch designer, who
might instead prize energy efficiency.
A more radical rethinking of scores ac-
knowledges that often there’s no “ground
truth” against which to say a model is right
or wrong. People disagree on what’s funny
or whether a building is tall. Some bench-
mark designers just toss out ambiguous
or controversial examples from their test
data, calling it noise. But last year, Massimo
Poesio, a computational linguist at Queen
Mary University of London, and his col-
leagues created a benchmark that evaluates
a model’s ability to learn from disagreement
among the human data labelers.
They trained models on pairs of text
snippets that people ranked for their rela-
tive humorousness. Then they showed new
pairs to the models and asked them to judge
the probability that the first was funnier,
rather than simply providing a binary yes or
no answer. Each model was scored on how
closely its estimate matched the distribu-
tion of annotations made by humans. “You
want to reward the systems that are able to
tell you, you know, ‘I’m really not that sure
about these cases. Maybe you should have a
look,’” Poesio says.
AN OVERARCHING problem for benchmarks
is the lack of incentives for developing
them. For a paper published last year,
Google researchers interviewed 53 AI prac-
titioners in industry and academia. Many
noted a lack of rewards for improving
data sets—the heart of a machine-learning
benchmark. The field sees it as less glamor-
ous than designing models. “The movement
for focusing on data versus mod-
els is very new,” says Lora Aroyo,
a Google researcher and one of
the paper’s authors. “I think the
machine-learning community is
catching up on this. But it’s still
a bit of a niche.”
Whereas other fields value
papers in top journals, in AI
perhaps the biggest metric of
success is a conference presen-
tation. Last year, the prestigious
Neural Information Processing
Systems (NeurIPS) conference
launched a new data sets and
benchmarks track for reviewing
and publishing papers on these
topics, immediately creating new
motivation to work on them. “It
was a surprising success,” says
Vanschoren, the track’s co-chair.
Organizers expected a couple
dozen submissions and received
more than 500, “which shows
that this was something that
people have been wanting for a
long time,” Vanschoren says.
Some of the NeurIPS papers
offered new data sets or bench-
marks, whereas others revealed
problems with existing ones. One
found that among 10 popular vision, lan-
guage, and audio benchmarks, at least 3%
of labels in the test data are incorrect, and
that these errors throw off model rankings.
Although many researchers want to in-
centivize better benchmarks, some don’t
want the field to embrace them too much.
They point to one version of an apho-
rism known as Goodhart’s law: When you
teach to the test, tests lose their validity.
“People substitute them for understand-
ing,” Ribeiro says. “A benchmark should
be a tool in the toolbox of the practitioner
where they’re trying to figure out, ‘OK,
what’s my model doing?’” j
CREDITS: (GRAPHIC K. FRANKLIN/ Matthew Hutson is a journalist in New York City.
SCIENCE;
(DATA D. KIELA
ET AL
., DYNABENCH: RETHINKING BENCHMARKING IN NLP, DOI:10.48550/ARXIV.2104.14337
2000 2005 2010 2015 2020
–1.0
–0.8
–0.6
–0.4
–0.2
0.0
0.2
Relative model performance
Human
performance
ImageNet (image recognition) GLUE (language understanding)
Switchboard (speech recognition) SQuAD 2.0 (reading comprehension)
MNIST (handwriting recognition) SQuAD 1.1 (reading comprehension)
Benchmarks
Quick learners
The speed at which artificial intelligence models master benchmarks and surpass
human baselines is accelerating. But they often fall short in the real world.
6 MAY 2022 • VOL 376 ISSUE 6593 573
0506NewsFeature_15546831.indd 573 5/2/22 8:23 PM