Artificial intelligence has come a long way over the past few years, particularly where speech recognition and natural language processing (NLP) are concerned. There was a time when the idea of automated captioning would be utterly absurd. However, in recent years, accurate video captioning has become essential.

In the wake of the pandemic, there was an incredibly pressing need to make public health information and announcements more available and accessible.  Video captioning thus became essential to deaf and hearing-impaired individuals. And Automated Speech Recognition (ASR) technology, as it were, received a requisite boost in terms of interest and development. 

It's important to note, of course, that this technology is still no substitute for a human stenographer. Although AI is capable of rapid transcription at scale and into multiple languages, it does still struggle with accuracy in some scenarios.  Common faults in ASR technology include: 
  • Incorrect punctuation or grammar.
  • Recognition errors, particularly if multiple or overlapping voices are involved. 
  • Difficulty parsing starters and fillers such as "uh," "um," "mhm," "y'know," and so on. 
  • Issues understanding low-quality audio. 
  • Background noise. 
  • Confusion related to interrupted speech. 
  • Homonyms. 
  • Dealing with non-standard pronunciation. 
YouTube's automatic closed captioning is a perfect example of this in practice. Go to any video, and activate the technology. It will manage to transcribe the majority of speech accurately—however, you're almost guaranteed to encounter an obvious error or two. 

See, when the technology works, it works quite well. Unfortunately, it still frequently trips itself up, confusing certain words or not properly understanding certain accents and speech patterns.  This may be tolerable when one is simply watching a commentary video online, but for situations where accuracy is crucial—such as phone conversations—the technology simply isn't up to par just yet. 
 

The Challenge of ASR Configuration

The thing many people don't actually realize about ASR is that it's not simply a matter of setting a robot loose and having it start transcribing. There's a painstaking configuration process that requires a human engineer to feed comprehensive speech data into an algorithm, allowing it to train itself to recognize certain auditory patterns gradually. 

Typically, this consists of three components: 
  • An acoustic model trained to recognize and predict phonemes, the smallest possible unit of speech. 
  • An interpretation algorithm that parses input received from the acoustic model, comparing it to a pre-existing vocabulary or lexicon for interpretation purposes.
  • A language algorithm that combines the two components above to create machine-readable patterns, which are then translated to human-readable text. 
An ASR solution is only as good as the lexical and auditory data it has been fed. And while this means that ASR technology will inevitably become more complex and accurate over time, it also means that this is an innovation that's very, very easy to do wrong.  It's also incredibly important to note that ASR technology does not understand context, nor is it able to apply intuitive judgment to situations.

The Limitations of Artificial Intelligence

This is not a limitation with ASR technology itself but with artificial intelligence as a whole. While machines are inarguably better than humans when it comes to processing large volumes of data or performing calculations, they fall far short in other areas. There are certain things a human actor can do that are completely beyond the capabilities of even the most advanced supercomputer—at least for the time being.

What this means is that whereas a human captioner or interpreter will often realize when a speaker corrects themselves mid-statement or stumbles over their words and adjust the transcript accordingly, ASR will simply transcribe the speaker's verbatim words. 

For the moment, the best way to address this is to add human agents as an extra 'editing layer' between ASR and the speakers/readers. Many caption providers do this for live events already. Automated captioning is also seeing progressively more widespread use in the virtual meetings space, providing transcripts for marketing and auditing purposes. 

The Revolution Isn't Quite Here Yet

Someday, ASR technology will advance to the point where human interpreters are no longer required. Deaf and hearing-impaired people will be able to view and listen to media, make phone calls, and more with ease without waiting for a communication assistant. Of course, by that time, one of the many treatments purported to reverse hearing loss may have become widespread. 

It's difficult to say what the future holds for hearing assistance technology. But if current events are any indication? It's promising indeed.