Science - USA (2022-05-06)

NEWS

SCIENCE science.org

ployment or when probed carefully. “They fall apart in embarrassing ways pretty eas- ily,” Bowman says. By strategically adding stickers to a stop sign, for example, researchers in 2018 fooled standard image recognition systems into seeing a speed limit sign instead. And a 2018 project called Gender Shades found the accuracy of gender identification for commercial face-recognition systems dropped from 90% to 65% for dark-skinned women’s faces. “I really don’t know if we’re prepared to deploy these systems,” says Deborah Raji, a computer scientist at Mozilla who collaborated on a follow-up to the original Gender Shades paper. Natural language processing (NLP) models can be fickle, too. In 2020, Marco Túlio

Ribeiro, a computer scientist at Microsoft, and his colleagues reported many hidden bugs in top models, including those from Microsoft, Google, and Amazon. Many give wildly different outputs after small tweaks to their inputs, such as replacing a word with a synonym, or asking “what’s” versus “what is.” When commercial models were tasked with evaluating a statement that in- cluded a negation at the end (“I thought the plane [ride] would be awful, but it wasn’t”), they almost always got the sense of the sen- tence wrong, Ribeiro says. “A lot of people did not imagine that these state-of-the-art models could be so bad.” The solution, most researchers argue, is not to abandon benchmarks, but to make them better. Some want to make the tests

tougher, whereas others want to use them to illuminate biases. Still others want to broaden benchmarks so they present ques- tions that have no single correct answer, or measure performance on more than one metric. The AI field is starting to value the unglamorous work of developing the training and test data that make up benchmarks, says Bowman, who has now con- structed more than a dozen of them. “Data work is changing quite a bit,” he says. “It’s gaining legitimacy.”

THE MOST OBVIOUS path to improving benchmarks is to keep making them harder. Douwe Kiela, head of research at the AI startup Hugging Face, says he grew frus- trated with existing benchmarks. “Bench-

NEWNEWSS

THE TEST

6 MAY 2022 • VOL 376 ISSUE 6593 571

AI software clears high

hurdles on IQ tests

but still makes dumb

mistakes. Can better

benchmarks help?

By Matthew Hutson

0506NewsFeature_15546831.indd 571 5/2/22 8:23 PM

Science - USA (2022-05-06)

THE TEST

Get our desktop app

Company

Features

Documentation

Resources