End to End Automatic Speech Recognition: Introduction

In this article, we looked at the basic elements of an end-to-end Automatic Speech Recognition pipeline, the major challenges encountered with these pipelines, and some of the potential solutions.

2 years ago   •   11 min read

By Anuj Sable

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales


Automatic Speech Recognition or ASR as it is known more commonly in the deep learning community is the ability to consume a speech audio signal and output an accurate textual representation of said speech input. This field of research, like many others, had seen its development stagnate until deep learning approaches enabled new techniques to increase the performance and efficiency of ASR models.

In the first two parts of this series which can be found here and here, we explore the tools that enable us to look more deeply into acoustic modeling for tougher and more challenging downstream tasks. Before you dive into this article, I would recommend you to visit the first articles referenced above to get a grasp on the basics of audio signal processing.

In this article we will try to dissect the commonly seen ASR pipelines that allow for end-to-end recognition of speech, starting as a sound recording and pushing out text that transcribes the sound.

Online vs Offline ASR

Before we go any further, let us first understand the difference between Online and Offline ASR. Online ASR happens as the speaker is speaking, the transcription process happening simultaneously with a minor lag. The computational power required for our ASR models and low latency is paramount for online ASR tasks along with the accuracy of the transcriptions, the three objectives being in conflict with each other.

Offline ASR on the other hand happens when the entire speech recording is available to us beforehand and our models can focus on transcribing the speech into its text as accurately as possible. The results can then be served as required by your application. Offline ASR can allow us to afford bigger models that can take a bit more resources and time but do not compromise on accuracy.

Both kinds of models are extremely relevant as they serve different applications downstream which include but are not limited to conversational agents, expert agents, RPA bots, speech analysis and summarization, etc.

While the two modes of ASR are different, the underlying technology intersects for the most part; hence we will mention this bifurcation only when necessary.

The ASR Pipeline

The Automation of speech recognition, as you can intuitively imagine, will require us to break the problem into simpler sub-problems like speech and language.

Broadly, the pipelines consist of two major components:

  1. Acoustic encoders
  2. Textual decoders

The acoustic encoders output the encoding for a speech input so that it can be consumed by the decoders to output meaningful textual representations of the encoding.

Trying to understand this more deeply, we are faced with the following problems

  1. Acoustic Modeling - Turning a time-domain audio signal into frequency-domain information and utilizing that to create models that are able to understand differences between different kinds of speech - different languages, different speakers, different utterances, different kinds of noise and therefore the occurrences of silence, etc.
  2. Language Modeling - Taking the features provided by acoustic models and transcribing them into meaningful and accurate text, as is experienced by humans when they understand what is being said. Besides understanding the audio input, a text model also needs to understand the grammar and syntax of the language in the utterances made, the semantic relationships between different words, etc.
  3. Decoding Algorithms - Different strategies can be utilized to take the language embeddings to figure out the text output. These will depend on the kind of language models we have used, whether they are character-based or lexicon-based if there is a deep-learning-based language model associated with the decoder pipeline, etc.
  4. Speech to Text Alignment - Speech if broken into time segments doesn't directly align with textual output, so we deal with variable vector spaces and being able to align multiple speech frames to one character/word becomes important for our applications.
source: https://www.mdpi.com/2073-8994/11/8/1018

One by one, we will look into all of these parts of the pipeline.

Acoustic Modeling

What we get as sound from our recording equipment is usually a waveform that varies in amplitude over time. This information seems random at first and is very difficult to make sense of. Autoregressive models can sometimes help reveal patterns in such waveforms but they aren't enough for speech recognition. To see deeper into the waveform, we transform them into their frequency domain. This is done using Fourier Transforms. The Short-Time Fourier Transform gives us different frequency bins and the magnitude of sound in each frequency bin per frame of sound.


Human beings do not perceive sound on a linear scale. They instead perceive it on a logarithmic scale. The linear frequency scale is therefore converted to mel-scale using the following formula

$$ m = 2595 . log_{10} ​(1+\frac{f}{700}​) $$

MFCCs are calculated by applying a pre-emphasis filter on an audio signal, taking the STFT of that signal, applying mel-scale-based filter banks, taking a DCT (discrete cosine transform), and normalizing the output. The pre-emphasis filter is a way of stationarizing the audio signal using a weighted single-order time difference of the signal. The filter banks are a bunch of triangular waveforms.

Phonemes and Graphemes

A phoneme is the smallest unit of sound that contains semantic content. It is different from the alphabet, and has a sonic element to it. Acoustic content changes lead to changes in phonemes that can be analyzed to turn them into recognizable language. Not all acoustic changes cause phonemic changes though, for example, singing a song doesn't change its lyrical content.

One major way of segregating phonemes into buckets is by understanding voicing. A phoneme is voiced if the sound comes from the vocal cords. This is in contrast with unvoiced phonemes. For example, the word "vat" or the sound "V" is made using the vocal cords whereas in the word "fat" the sound "F" is made without any vocal cord intervention.

Vowels are always voiced and differ in which formants (lips, tongue, etc.) are used and how. Consonants have various general classes which have voiced and unvoiced members. You can learn more about phonemes at the link above.

The Grapheme has been described as the "smallest contrastive linguistic unit which may bring about a change of meaning." There are about 40 distinctive phonemes in English, but 70 letters or letter combinations symbolize phonemes.

For example, the word 'ghost' contains five letters and four graphemes ('gh,' 'o,' 's,' and 't'), representing four phonemes.


Hidden Markov Models are sequential modelling tools that make the strong assumption that the state of a system is dependent only on its previous state and not on the states that come before that.  

$$ P(q_{i} = a | q_{1} ... q_{a}) = P(q_{i} = a | q_{i-1})$$

Despite this strong and often wrong assumption, HMM-based models saw a lot of success in the speech recognition area where the sound content was speaker-specific.

source: https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196

The image above represents phones in a textual format that is similar to how we spell in the English language. The words phoneme and grapheme are used interchangeably.

A phone is predicted using MFCC features of the previous state. The HMM probabilities of each phone are learned as Gaussian Mixture Models which act as the emission probability for the HMM.

The Gaussian Mixture Model is a probabilistic model that assumes all the data points are generated from a mix of Gaussian distributions with unknown parameters. In our case, each phone is then represented as a mixture of Gaussian models for the phone preceding it as dictated by the Markov assumption. The Gaussian Mixture Models can be modeled for more than 2 modalities as well. Expectation maximization algorithms are used to learn the Gaussian Mixture Model parameters.

This algorithm can be solved in polynomial time using dynamic programming and the approach is explained in detail here.

Note that lexical modeling is still not accounted for here and speech is transformed into its phonetic transcripts instead of textual representations.

Deep Acoustic Models

The time-domain information, instead of being represented by HMMs, can be represented using deep neural networks like RNNs and their variants. CNNs can also be used for spectral information instead of RNNs to model the temporal states using MFCCs as inputs.

The current state-of-the-art models apply much more novel approaches than vanilla CNNs or RNNs or LSTMs. These include attention-based models, transformer models, variational autoencoders, etc. We will learn about these later while reviewing some of the influential papers in the domain. Deep Acoustic models have been able to help make acoustic modeling become more generalized across speakers, languages, accents, domains, etc.

Language Modeling

Language is extremely complex, and, even devoid of the speech component, language is difficult to write rules for. The same words are used differently in different contexts across different cultures. Some words spelt the same way have different meanings. Some words phonetically sound the same but have different spellings. The grammar changes with each language and the generalizability of syntax rules in different languages are low.

Language Modeling takes up the difficult task of predicting what the next word should be, given context. This can be done using statistical methods or deep neural networks which are reflected in the evolution of language modeling techniques with time as well. Language models take up the task of assigning probabilities to letters or words given the previous words or letters respectively.

PoS and Dependency Parsing

Parts of Speech Tagging or PoS tagging is exactly what it sounds like. The ability to label each token or word as a particular part of speech. Some universal parts of speech include nouns, adjectives, pronouns, adverbs, etc. You might recognize these from your high-school language lectures.

Dependency parsing gives us the ability to build syntactical relations between different parts of speech by creating a tree structure. There are many different parsers out there that take a PoS-tagged speech corpus and convert it into a dependency tree which is a functionality offered by all the common NLP tools like NLTK, SpaCy, etc.  

You can learn more about context-dependent grammar and dependency parsing here.


Lexicons refer to the component of an NLP system that contains information (semantic, grammatical) about individual words or word strings. For most applications, a Lexicon will refer to the list of words and punctuation and the characters or strings of characters associated with each word or punctuation.

Lexicons describe words which fall in a natural taxonomy based on parts of speech and the relations that lie therein as described by dependency parsing.

The concepts above, though not language-independent, allow us to computationally understand the content of a particular sentence and much of language modeling work earlier was dominated by language understanding using these concepts.  

Statistical Language Models

Statistical language modeling is mostly about estimating $ Pr(S) $, where $ S $ is a corpus of sentences whereas computational linguistics is mostly about estimating $ Pr(H|S) $ where $ H $ is the hidden state of language (like syntax and dependency trees as discussed above).

As this article rightly points out, there are two important problems when trying to model language - sparsity and context.

Sparsity refers to the broad vocabulary of every language and our ability to encode these effectively. The most intuitive approach is to create N-dimensional one-hot vectors where N is the size of the vocabulary. But this creates matrices with increasingly sparse data as the size of the sentences increases. We encounter what is famously known as the Curse of Dimensionality when we take such an approach.

The context of a word is for the purpose of language modeling, the words that surround a word. The words before and after a word provide a lot of information about the usage of said word and hence influences the probabilities of its occurrence. The goal of statistical language modeling is to predict a word given its context.

Deep Language Models

Deep Language Models solve for sparsity and context understanding by representing each word as an N-dimensional embedding.

Embeddings are learnable representations in an N-dimensional space that are generated by various deep learning architectures. Considering the temporal nature of language, RNNs were first used for this. RNNs fell victim to many problems including exploding and vanishing gradients. A modified version, the Long-Short Term Memory (LSTM) network came into being next, but still lacked the ability to capture long-range dependencies in language across a paragraph or multiple paragraphs.

Lately, attention-based models and transformers have seen an increasing amount of success in the NLP and modeling domains.  

Decoding Algorithms

Acoustic models can give us emission probabilities along time for a vocabulary we define. The models can be based on characters or strings of characters. There are various decoding strategies to utilize these probabilities to make predictions of what a particular speech pattern represents in terms of letters/graphemes/words.

Greedy Decoding

Greedy decoding is simply taking the token that has the highest conditional probability.

$$ y_{t} = argmax_{y \in V} P(y|y_{1}...y_{t-1}) $$

In character-based models, this can often be a naive approach since context-based probabilities are often hidden in probability distributions. The greedy decoder makes the assumption that the token with the highest probability is the correct token but often the possibilities of words preceding the token predicted at time $t$ are also quite relevant in making the right prediction.

Beam Decoding

Beam decoding extends the Greedy Decoding algorithm to account for multiple possible sequences that can come together to generate high probability outcomes instead of taking only the highest probability tokens at each time frame.

Beam decoders are best explained visually as is done here.

Beam search makes two important improvements to the greedy decoder algorithm.

  1. Instead of taking just the top token, beam search takes top $N$ tokens into consideration.
  2. Instead of considering each time step in isolation, beam search considers the joint probabilities of the preceding words and picks out the $N$ best sequences.

A simple tutorial on how to implement a beam search algorithm can be found here.

Speech to Text Alignment

The time frames in speech don't correspond to a one-to-one mapping when converted to text. This requires us to map multiple frames of speech to a single token of text. This misalignment makes most of the normal approaches to loss-based deep learning optimization unsuitable for speech recognition applications.  

CTC Loss

CTC Loss or the Connectionist Temporal Classification Loss is alignment-free. It introduces a new token often referred to as a blank token to annotate the breaks in alignments of different tokens. The tokens between the blank tokens are merged if multiple time steps are represented by the same character and left as is if they are not repeated. the blank token ensures that words with repeated tokens don't collapse into wrong spellings.

source: https://distill.pub/2017/ctc/

This article provides a very good explanation of how CTC works, how the algorithm uses a clever dynamic programming approach to make it computationally less expensive and how it solves the problem of misalignment in a differentiable manner so that our neural networks can learn from the loss function using gradient descent.

RNN-T Loss

The RNN-T Loss or the RNN - Transducer loss solves the alignment problem by using a predictor network and a joining network along with the embedding network we have from our acoustic models. The predictor network is an autoregressive network which is implemented using a GRU. The joiner is a simple feedforward network that combines the encoder vector $f_{t}$ and predictor vector $g_{u}$ and outputs a softmax $h(t,u)$ over all the labels, as well as a “null” output $\phi$.

source: https://lorenlugosch.github.io/posts/2020/11/transducer/

As the loss function, the RNN-T Loss sums up all the probabilities for all possible alignments in the log domain and uses that as the loss function.

Some notable things about the RNN Transducer network are

  1. The predictor network only has access to $y$ and hence can be trained only on text data.
  2. This model can be used for streaming or online ASR.

A great explanation for the RNN Transducer loss can be found here. Code tutorial for the same can be found by clicking the link below.

Bring this tutorial to life


In this article, we looked at the basic elements of an end-to-end ASR pipeline, the major challenges encountered with these pipelines, and some of the potential solutions. We looked at statistical acoustic modeling, statistical language modeling, decoding/search algorithms and solutions for the misalignment problem between speech input and text output. We then closed with a section on speech-to-text alignment where we looked at CTC Loss and RNN-T loss.

In this article, we barely scratched the surface of the deep learning algorithms used for acoustic modeling or language modeling. Some of the state-of-the-art algorithms are able to combine the two functions of acoustic modeling and language modeling into one network. We will be looking at many such interesting approaches that are at the forefront of deep-learning-based ASR research in the next one.

Stay tuned!

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word

Keep reading