Whisper — The art of listening.

Shaunak Inamdar
5 min readDec 18, 2022

“The human voice is the most beautiful instrument of all, but the most difficult to play.” — Richard Strauss.

Speech is a part of thought

The most primal form of expression in humans is speech. It represents the mental concepts and conveys who we are at all levels of being: conscious and subconscious. We also perceive speech as much more than just words or tone. It is a complex blend of multiple features that our brain understands. As humans, we understand each other. To an extent we are able to communicate and understand even with broken sentences, mispronounced words and partial clues. However, until a few years ago, it was difficult to communicate with machines in the same way.

While understanding what someone is saying when speaking to us may seem trivial, it took years and years of evolution to hardwire this survival instinct into our brains. And now, we can talk on our phones and have in-depth conversations with machines thanks to the magic of Automatic Speech Recognition software(ASR).

Speech Recognition is a way of converting sound signals to readable, understandable data in the form of transcripts. We can then use these transcripts for mining data, executing instructions or simply storing information.

A schematic representation of the speech production process.

Speech Recognition has been a very challenging problem in the history of computer science. It is crucial but complicated. When speaking, our lungs and vocal cords cause the air inside our body to vibrate. This vibration creates a domino chain of oscillating air which escapes through our mouth and nose and bumps against the microphone of our computers. These vibrations have a specific pattern, and when dealing with a limited number of words, it is very easy to detect this pattern. However, understanding complete sentences is complicated as it requires our brain to form connections between the words.

In a semantic sense, human speech is very, very sloppy. We all mispronounce things all the time. Affectations which involve mumbling, leaving out parts of words, and joining multiple words together(“y’all,” “bouta,” “gonna”). Everyone speaks with a different accent and at a unique pitch and speed. Our minds correct for all of this and then listen for words.

On the other hand, a computer can be easily confused by that. Moreover, words are much more than phonemes strung together. A “phoneme” is the smallest unit of sound. If we listen to the phonemes of “T,” “E,” and “A” individually, it is going to be very difficult to interpret them as “TEA.” Therefore, you cant listen to individual snippets of sounds to make out a word.

The Speech Recognition engines overcome all these challenges to interpret words from human speech. There is a lot of processing behind removing noise, splitting it into chunks, tokenizing the sounds, and then figuring out what you most likely spoke into the microphone.

Speech Recognition is a complex system of multiple steps.

OpenAI has released an open-source state-of-the-art model called Whisper for speech recognition. Whisper is an end-to-end, weakly supervised transformer-based framework. While the machine learning community has been widely obsessed with image generation models lately, Whisper stands out. This speech-to-text model not only transcribes speech but also translates into 96 different languages. This model can generalize better than most other alternatives and closely matches human performance.

This model is based on a Transformer architecture, stacking encoder and decoder blocks on top of each other and propagating information between the two via the Attention mechanism. It takes in the audio and splits it into chunks of 30 seconds. The encoder mechanism will encode the audio and save “timestamps” of each word said in the chunk. This is done using Positional Encoding. The transformer will leverage the encoding and predict a token using the decoder. These tokens are words that are said in the audio.

This entire process is repeated for the corresponding word. However, the Whisper model also predicts the likeliness of the next word using Natural Language Processing techniques.

How tokenization of speech works.

The sheer size of this model explains its impressive performance. It has been trained on 680,000 hours of multilingual data from the internet. It has multiple sizes varying from 349 million to 1.5 Billion parameters and 32 layers. This makes it capable of generalizing. There are also special tokens used for the model to understand whether it has to translate or transcribe the input. A single model performing multiple tasks is highly unusual as typically, models are very specialized.

Whispers show an emerging rule of thumb in deep learning: if you want to train a model to do a task, pick a task that is harder than the task you want it to do.

I believe that the robustness of Whisper is almost human-level as it is trained on data from TedTalks, Podcasts, interviews, etc., all media that highly influence human speech. All of this data has been previously transcribed using smaller ML models and hence is not “Gold — Standard”

data for training. However, this works in its favor as we have already seen how imperfect real-world human speech is. This contributes to the robustness of the model and can be used for many tasks. What is especially impressive is that OpenAI has open-sourced the entire code, which can be inferenced quickly. Many people have made exciting transformer spaces for you to try out. For Whisper to reach close to 5% of human ability is a formidable feat.

Large models like GPT-3 provided heavily impactful applications such as GitHub Copilot, DALL-E, Stable Diffusion, etc., so it is interesting to ponder what might be next. With Whisper being so efficient yet so lightweight, it could be possible that OpenAI might be using it to create the data needed to train GPT-4. Whisper can quickly provide enormous amounts of tokens from vast cyberspace and gather data points from multiple smartphones, Teslas, smart speakers, and virtual assistants. This data could be used for surveillance, targeted ads, or training a larger GPT-n model. The longer GPT-4 remains unreleased; the higher curiosity rises about its development.
A model capable of understanding speech would also be capable of understanding thought, as in most cases, the mouth speaks what the heart is full of.

Thank you very much for reading :) Please subscribe for more such articles and follow me on Instagram, Medium, Twitter and LinkedIn.

--

--

Shaunak Inamdar

Shaunak Inamdar is a CS undergrad with a passion for writing about technologies and making them accessible to a broader audience. www.shaunak.tech