Science - USA (2022-05-06)

(EriveltonMoraes) #1
NEWS

science.org SCIENCE

ILLUSTRATION: JASON SOLO/THE JACKY WINTER GROUP

T


rained on billions of words from
books, news articles, and Wiki-
pedia, artificial intelligence (AI)
language models can produce
uncannily human prose. They
can generate tweets, summarize
emails, and translate dozens of
languages. They can even write tol-
erable poetry. And like overachiev-
ing students, they quickly master the tests,
called benchmarks, that computer scien-
tists devise for them.
That was Sam Bowman’s sobering ex-
perience when he and his colleagues cre-
ated a tough new benchmark for language
models called GLUE (General Language
Understanding Evaluation). GLUE gives
AI models the chance to train on data sets

containing thousands of sentences and con-
fronts them with nine tasks, such as decid-
ing whether a test sentence is grammatical,
assessing its sentiment, or judging whether
one sentence logically entails another. After
completing the tasks, each model is given
an average score.
At first, Bowman, a computer scientist
at New York University, thought he had
stumped the models. The best ones scored
less than 70 out of 100 points (a D+). But in
less than 1 year, new and better models were
scoring close to 90, outperforming humans.
“We were really surprised with the surge,”
Bowman says. So in 2019 the researchers
made the benchmark even harder, calling
it SuperGLUE. Some of the tasks required
the AI models to answer reading compre-

hension questions after digesting not just
sentences, but paragraphs drawn from
Wikipedia or news sites. Again, humans
had an initial 20-point lead. “It wasn’t that
shocking what happened next,” Bowman
says. By early 2021, computers were again
beating people.
The competition for top scores on bench-
marks has driven real progress in AI. Many
credit the ImageNet challenge, a computer-
vision competition that began in 2010, with
spurring a revolution in deep learning, the
leading AI approach, in which “neural net-
works” inspired by the brain learn on their
own from large sets of examples. But the
top benchmark performers are not always
superhuman in the real world. Time and
again, models ace their tests, then fail in de-

NEWNEWNWSS

FEATURES


TAUGHT TO


570 6 MAY 2022 • VOL 376 ISSUE 6593

0506NewsFeature_15546831.indd 570 5/2/22 8:23 PM

Free download pdf