Scientific American - USA (2020-05)

24 Scientific American, May 2020

VENTURES
THE BUSINESS OF INNOVATION

Wade Roush is the host and producer of Soonish, a podcast about technology, culture, curiosity and the future. He is a co-founder of the podcast collective Hub & Spoke and a freelance reporter for print, online and radio outlets, such as MIT Technology Review, Xconomy, WBUR and WHYY.

Illustration by Jay Bendt

Back in 2010 Matt Thompson, then with National Public Radio,
forecast in an op-ed that “at some point in the near future, auto-
matic speech transcription will become fast, free, and decent.” He
called that moment the “Speakularity,” in a sly reference to inven-
tor Ray Kurzweil’s vision of the “singularity,” in which our minds
will be uploaded into computers. And Thompson predicted that
access to reliable automatic speech-recognition (ASR) software
would transform the work of journalists—to say nothing of law-
yers, marketers, people with hearing disabilities, and everyone else
who deals in both spoken and written language.
Desperate for any technology that would free me from the
exhausting process of typing real-time notes during interviews, I
was enraptured by Thompson’s prediction. But while his brilliant
career in radio has continued (he is now editor in chief of the Cen-
ter for Investigative Reporting’s news output, including its show
Reveal ), the Speakularity seems as far away as ever.
There has been important progress, to be sure. Several start-
ups, such as Otter, Sonix, Temi and Trint, offer online services

that allow customers to upload digital audio files and, minutes later, receive computer-generated transcripts. In my life as an audio producer, I use these services every day. Their speed keeps in - creasing, and their cost keeps going down, which is welcome. But accuracy is another matter. In 2016 a team at Microsoft Research announced that it had trained its machine-learning algorithms to tran- scribe speech from a standard corpus of recordings with record-high 94 percent accuracy. Professional human transcriptionists performed no better than the program in Microsoft’s tests, which led media outlets to celebrate the arrival of “parity” between humans and software in speech recognition. The thing is, that last 6 percent makes all the difference. I can tell you from bitter experience that cleaning up a transcript that is 94 percent accurate can take almost as long as transcribing the audio manually. And four years after that breakthrough, services such as Temi still claim no better than 95 percent—and then only for recordings of clear, unaccented speech. Why is accuracy so important? Well, to take one example, more and more audio producers (including myself ) are complying with Internet accessibility guidelines by publishing transcripts of their podcasts—and no one wants to share a transcript in which one in every 20 words contains an error. And think how much time people could save if voice assistants such as Alexa, Bixby, Cortana, Google Assistant and Siri under- stood every question or command the first time. ASR systems may never reach 100 percent accuracy. After all, humans do not always speak fluently, even in their native lan- guages. And speech is so full of homophones that comprehension always depends on context. (I have seen transcription services ren- der “iOS” as “ayahuasca” and “your podcast” as “your punk ass.”) But all I am asking for is a 1 or 2 percent improvement in accuracy. In machine learning, one of the main ways to reduce an algo- rithm’s error rate is to feed it higher-quality training data. It is going to be crucial, therefore, for transcription services to figure out privacy-friendly ways of gathering more such data. Every time I clean up a Trint or Sonix transcript, for example, I am generat- ing new, validated data that could be matched to the original audio and used to improve the models. I would be happy to let the com- panies use it if it meant there would be fewer errors over time. Getting such data is surely one path to the Speakularity. Given the growing number of conversations we have with our machines and the increasing amount of audio created every day, we should not be thinking of decent automatic transcription as a luxury or an aspiration anymore. It is an absolute necessity.

JOIN THE CONVERSATION ONLINE Visit Scientific American on Facebook and Twitter or send a letter to the editor: [email protected]

Seeking Software

That Hears Better

In the speech-recognition business,

95 percent accuracy might as well be zero

By Wade Roush
QQ group:1067583220

Scientific American - USA (2020-05)

Get our desktop app

Company

Features

Documentation

Resources