Artificial intelligence has come a long way over the past few years, particularly where speech recognition and natural language processing (NLP) are concerned. There was a time when the idea of automated captioning would be utterly absurd. However, in recent years, accurate video captioning has become essential.
In the wake of the pandemic, there was an incredibly pressing need to make public health information and announcements more available and accessible. Video captioning thus became essential to deaf and hearing-impaired individuals. And Automated Speech Recognition (ASR) technology, as it were, received a requisite boost in terms of interest and development.
It's important to note, of course, that this technology is still no substitute for a human stenographer. Although AI is capable of rapid transcription at scale and into multiple languages, it does still struggle with accuracy in some scenarios. Common faults in ASR technology include:
- Incorrect punctuation or grammar.
- Recognition errors, particularly if multiple or overlapping voices are involved.
- Difficulty parsing starters and fillers such as "uh," "um," "mhm," "y'know," and so on.
- Issues understanding low-quality audio.
- Background noise.
- Confusion related to interrupted speech.
- Homonyms.
- Dealing with non-standard pronunciation.
YouTube's automatic closed captioning is a perfect example of this in practice. Go to any video, and activate the technology. It will manage to transcribe the majority of speech accurately—however, you're almost guaranteed to encounter an obvious error or two.
See, when the technology works, it works quite well. Unfortunately, it still frequently trips itself up, confusing certain words or not properly understanding certain accents and speech patterns. This may be tolerable when one is simply watching a commentary video online, but for situations where accuracy is crucial—such as phone conversations—the technology simply isn't up to par just yet.