Science - USA (2022-05-06)

SCIENCE science.org

of her benchmarks, TuringAdvice, does just that—asking models to answer requests for advice posted on Reddit. So far, however, results are not spectacular—the AI responses only beat human responses about 15% of the t i m e. “ I t ’s k i n d o f a n o v e r l y a m b i t i o u s l e a d e r - board,” she says. “Nobody actually wants to work on it, because it’s depressing.” Bowman has a different approach to clos- ing off shortcuts. For his latest benchmark, posted online in December 2021 and called QuALITY (Question Answering with Long Input Texts, Yes!), he hired crowdworkers to generate questions about text passages from short stories and nonfiction articles. He hired another group to answer the questions after reading the passages at their own pace, and a third group to answer them hurriedly under a strict time limit. The benchmark consists of questions that the careful readers could answer but the rushed ones couldn’t; it leaves few shortcuts for an AI.

BETTER BENCHMARKS are only one part of the solution, researchers say. Developers also need to avoid obsessing over scores. Joaquin Vanschoren, a computer scientist at Eindhoven Univer- sity of Technology, decries the emphasis on being “state of the art” (SOTA)—sitting on top of a leaderboard—and says “SOTA chasing” is stifling innovation. He wants the reviewers who act as gatekeepers at AI conferences to de-emphasize scores, and envi- sions a “not-state-of-the-art track, or something like that, where you focus on novelty.” The pursuit of high scores can lead to the AI equivalent of dop- ing. Researchers often tweak and juice the models with spe- cial software settings or hard- ware that can vary from run to run on the benchmark, resulting in model perfor- mances that aren’t reproducible in the real world. Worse, researchers tend to cherry- pick among similar benchmarks until they find one where their model comes out on top, Vanschoren says. “Every paper has a new method that outperforms all the other ones, which is theoretically impossible,” he says. To combat the cherry-picking, Vanschoren’s team recently co-created OpenML Benchmarking Suites, which bun- dles benchmarks and compiles detailed performance results across them. It might be easy to tailor a model for a particular benchmark, but far harder to tune for doz- ens of benchmarks at once.

Another problem with scores is that one number, such as accuracy, doesn’t tell you everything. Kiela recently re- leased Dynaboard—a sort of companion to Dynabench. It reports a model’s “Dyna- score,” its performance on a benchmark across a variety of factors: accuracy, speed, memory usage, fairness, and robustness to input tweaks. Users can weight the factors that matter most for them. Kiela says an engineer at Facebook might value accuracy more than a smartwatch designer, who might instead prize energy efficiency. A more radical rethinking of scores ac- knowledges that often there’s no “ground truth” against which to say a model is right

or wrong. People disagree on what’s funny or whether a building is tall. Some benchmark designers just toss out ambiguous or controversial examples from their test data, calling it noise. But last year, Massimo Poesio, a computational linguist at Queen Mary University of London, and his col- leagues created a benchmark that evaluates a model’s ability to learn from disagreement among the human data labelers. They trained models on pairs of text snippets that people ranked for their relative humorousness. Then they showed new pairs to the models and asked them to judge the probability that the first was funnier, rather than simply providing a binary yes or no answer. Each model was scored on how

closely its estimate matched the distribu- tion of annotations made by humans. “You want to reward the systems that are able to tell you, you know, ‘I’m really not that sure about these cases. Maybe you should have a look,’” Poesio says.

AN OVERARCHING problem for benchmarks is the lack of incentives for developing them. For a paper published last year, Google researchers interviewed 53 AI prac- titioners in industry and academia. Many noted a lack of rewards for improving data sets—the heart of a machine-learning benchmark. The field sees it as less glamor- ous than designing models. “The movement for focusing on data versus models is very new,” says Lora Aroyo, a Google researcher and one of the paper’s authors. “I think the machine-learning community is catching up on this. But it’s still a bit of a niche.” Whereas other fields value papers in top journals, in AI perhaps the biggest metric of success is a conference presen- tation. Last year, the prestigious Neural Information Processing Systems (NeurIPS) conference launched a new data sets and benchmarks track for reviewing and publishing papers on these topics, immediately creating new motivation to work on them. “It was a surprising success,” says Vanschoren, the track’s co-chair. Organizers expected a couple dozen submissions and received more than 500, “which shows that this was something that people have been wanting for a long time,” Vanschoren says. Some of the NeurIPS papers offered new data sets or benchmarks, whereas others revealed problems with existing ones. One found that among 10 popular vision, language, and audio benchmarks, at least 3% of labels in the test data are incorrect, and that these errors throw off model rankings. Although many researchers want to in- centivize better benchmarks, some don’t want the field to embrace them too much. They point to one version of an apho- rism known as Goodhart’s law: When you teach to the test, tests lose their validity. “People substitute them for understanding,” Ribeiro says. “A benchmark should be a tool in the toolbox of the practitioner where they’re trying to figure out, ‘OK, what’s my model doing?’” j

CREDITS: (GRAPHIC K. FRANKLIN/ Matthew Hutson is a journalist in New York City.

SCIENCE;

(DATA D. KIELA

ET AL

., DYNABENCH: RETHINKING BENCHMARKING IN NLP, DOI:10.48550/ARXIV.2104.14337

2000 2005 2010 2015 2020

–1.0

–0.8

–0.6

–0.4

–0.2

0.0

0.2

Relative model performance

Human performance

ImageNet (image recognition) GLUE (language understanding)

Switchboard (speech recognition) SQuAD 2.0 (reading comprehension)

MNIST (handwriting recognition) SQuAD 1.1 (reading comprehension)

Benchmarks

Quick learners The speed at which artificial intelligence models master benchmarks and surpass human baselines is accelerating. But they often fall short in the real world.

6 MAY 2022 • VOL 376 ISSUE 6593 573

0506NewsFeature_15546831.indd 573 5/2/22 8:23 PM

Science - USA (2022-05-06)

Get our desktop app

Company

Features

Documentation

Resources