Scientific American - USA (2020-05)

(Antfer) #1

24 Scientific American, May 2020


VENTURES
THE BUSINESS OF INNOVATION


Wade Roush is the host and producer of Soonish, a podcast
about technology, culture, curiosity and the future. He
is a co-founder of the podcast collective Hub & Spoke and
a freelance reporter for print, online and radio outlets,
such as MIT Technology Review, Xconomy, WBUR and WHYY.

Illustration by Jay Bendt

Back in 2010 Matt Thompson, then with National Public Radio,
forecast in an op-ed that “at some point in the near future, auto-
matic speech transcription will become fast, free, and decent.” He
called that moment the “Speakularity,” in a sly reference to inven-
tor Ray Kurzweil’s vision of the “singularity,” in which our minds
will be uploaded into computers. And Thompson predicted that
access to reliable automatic speech-recognition (ASR) software
would transform the work of journalists—to say nothing of law-
yers, marketers, people with hearing disabilities, and everyone else
who deals in both spoken and written language.
Desperate for any technology that would free me from the
exhausting process of typing real-time notes during interviews, I
was enraptured by Thompson’s prediction. But while his brilliant
career in radio has continued (he is now editor in chief of the Cen-
ter for Investigative Reporting’s news output, including its show
Reveal ), the Speakularity seems as far away as ever.
There has been important progress, to be sure. Several start-
ups, such as Otter, Sonix, Temi and Trint, offer online services


that allow customers to upload digital audio files
and, minutes later, receive computer-generated
transcripts. In my life as an audio producer, I use
these services every day. Their speed keeps in -
creasing, and their cost keeps going down, which
is welcome.
But accuracy is another matter. In 2016 a team
at Microsoft Research announced that it had
trained its machine-learning algorithms to tran-
scribe speech from a standard corpus of recordings
with record-high 94 percent accuracy. Professional
human transcriptionists performed no better than
the program in Microsoft’s tests, which led media
outlets to celebrate the arrival of “parity” between
humans and software in speech recognition.
The thing is, that last 6  percent makes all the
difference. I can tell you from bitter experience
that cleaning up a transcript that is 94  percent
accurate can take almost as long as transcribing
the audio manually. And four years after that
breakthrough, services such as Temi still claim no
better than 95 percent—and then only for record-
ings of clear, unaccented speech.
Why is accuracy so important? Well, to take one
example, more and more audio producers (includ-
ing myself ) are complying with Internet accessibility guidelines
by publishing transcripts of their podcasts—and no one wants to
share a transcript in which one in every 20 words contains an error.
And think how much time people could save if voice assistants
such as Alexa, Bixby, Cortana, Google Assistant and Siri under-
stood every question or command the first time.
ASR systems may never reach 100 percent accuracy. After all,
humans do not always speak fluently, even in their native lan-
guages. And speech is so full of homophones that comprehension
always depends on context. (I have seen transcription services ren-
der “iOS” as “ayahuasca” and “your podcast” as “your punk ass.”)
But all I am asking for is a 1 or 2 percent improvement in accu-
racy. In machine learning, one of the main ways to reduce an algo-
rithm’s error rate is to feed it higher-quality training data. It is
going to be crucial, therefore, for transcription services to figure
out privacy-friendly ways of gathering more such data. Every time
I clean up a Trint or Sonix transcript, for example, I am generat-
ing new, validated data that could be matched to the original audio
and used to improve the models. I would be happy to let the com-
panies use it if it meant there would be fewer errors over time.
Getting such data is surely one path to the Speakularity. Given
the growing number of conversations we have with our machines
and the increasing amount of audio created every day, we should
not be thinking of decent automatic transcription as a luxury or
an aspiration anymore. It is an absolute necessity.

JOIN THE CONVERSATION ONLINE
Visit Scientific American on Facebook and Twitter
or send a letter to the editor: [email protected]

Seeking Software


That Hears Better


In the speech-recognition business,


95 percent accuracy might as well be zero


By Wade Roush
QQ group:1067583220

Free download pdf