Introduction to Encoder-Decoder Sequence-to-Sequence Models (Seq2Seq)

This series gives an advanced guide to different recurrent neural networks (RNNs). You will gain an understanding of the networks themselves, their architectures, applications, and how to bring them to life using Keras.

5 months ago   •   10 min read

By Samhita Alla

In this tutorial we’ll cover encoder-decoder sequence-to-sequence (seq2seq) RNNs: how they work, the network architecture, their applications, and how to implement encoder-decoder sequence-to-sequence models using Keras (up until data preparation; for training and testing models, stay tuned for Part 2).

Specifically, we'll cover:

  • Introduction
  • Architecture
  • Applications
  • Text Summarization Using an Encoder-Decoder Sequence-to-Sequence Model
You can also follow along with the full code in this tutorial and run it on a free GPU from a Gradient Community Notebook.

Let’s get started!

Bring this project to life

Introduction

An RNN typically has fixed-size input and output vectors, i.e., the lengths of both the input and output vectors are predefined. Nonetheless, this isn't desirable in use cases such as speech recognition, machine translation, etc., where the input and output sequences do not need to be fixed and of the same length. Consider a case where the English phrase "How have you been?" is translated into French. In French, you’d say "Comment avez-vous été?". Here, neither the input nor output sequences are fixed in size. In this context, if you want to build a language translator using an RNN, you do not want to define the sequence lengths beforehand.

Sequence-to-sequence (seq2seq) models can help solve the above-mentioned problem. When given an input, the encoder-decoder seq2seq model first generates an encoded representation of the model, which is then passed to the decoder to generate the desired output. In this case, the input and output vectors need not be fixed in size.

Architecture

The idea behind the design of this model is to enable it to process input where we do not constrain the length. One RNN will be used as an encoder, and another as a decoder. The output vector generated by the encoder and the input vector given to the decoder will possess a fixed size. However, they need not be equal. The output generated by the encoder can either be given as a whole chunk or can be connected to the hidden units of the decoder unit at every time step.

Encoder-Decoder Sequence-to-Sequence Model

The RNNs in the encoder and decoder can be simple RNNs, LSTMs, or GRUs.

In a simple RNN, every hidden state is computed using the equation:

$$H_t (encoder) = \phi(W_{HH} * H_{t-1} + W_{HX} * X_{t})$$

Where $\phi$ is the activation function, $H_t (encoder)$ represents the hidden states in an encoder, $W_{HH}$ is the weight matrix connecting the hidden states, and $W_{HX}$ is the weight matrix connecting the input and the hidden states.

The hidden states in the decoder can be computed as follows:

$$H_t (decoder) = \phi(W_{HH} * H_{t-1})$$

Note: The initial hidden state of the decoder is the final hidden state procured from the encoder.

The output generated by the decoder is given as follows:

$$Y_t = H_t (decoder) * W_{HY}$$

Where $W_{HY}$ is the weights matrix connecting the hidden states with the decoder output.

Applications of Seq2Seq Models

Seq2seq models are useful for the following applications:

  • Machine translation
  • Speech recognition
  • Video captioning
  • Text summarization

Now that you’ve got an idea about what a Sequence-to-Sequence RNN is, in the next section you’ll build a text summarizer using the Keras API.

Text Summarization Using a Seq2Seq Model

Text Summarization refers to the technique of shortening long pieces of text while capturing its essence. This is useful in capturing the bottom line of a large piece of text, thus reducing the required reading time. In this context, rather than relying on manual summarization, we can leverage a deep learning model built using an Encoder-Decoder Sequence-to-Sequence Model to construct a text summarizer.

In this model, an encoder accepts the actual text and summary, trains the model to create an encoded representation, and sends it to a decoder which decodes the encoded representation into a reliable summary. As the training progresses, the trained model can be used to perform inference on new texts, generating reliable summaries from them.

Here we’re going to use the News Summary Dataset. It consists of two CSV files: one contains information about the author, headlines, source URL, short article, and complete article, and another which only contains headlines and text. In the current application, you will extract the headlines and text from the two CSV files to train the model.

Note that you can follow along with the full code in this tutorial and run it on a free GPU from a Gradient Community Notebook.

Step 1: Importing the Dataset

First import the news summary dataset to your workspace using the pandas's read_csv() method.

import pandas as pd

summary = pd.read_csv('/kaggle/input/news-summary/news_summary.csv',
                      encoding='iso-8859-1')
raw = pd.read_csv('/kaggle/input/news-summary/news_summary_more.csv',
                  encoding='iso-8859-1')

Combine data from the two CSV files into a single DataFrame.

pre1 = raw.iloc[:, 0:2].copy()
pre2 = summary.iloc[:, 0:6].copy()

# To increase the intake of possible text values to build a reliable model
pre2['text'] = pre2['author'].str.cat(pre2['date'
        ].str.cat(pre2['read_more'].str.cat(pre2['text'
        ].str.cat(pre2['ctext'], sep=' '), sep=' '), sep=' '), sep=' ')

pre = pd.DataFrame()
pre['text'] = pd.concat([pre1['text'], pre2['text']], ignore_index=True)
pre['summary'] = pd.concat([pre1['headlines'], pre2['headlines']],
                           ignore_index=True)
Note: To increase the intake of data points to train the model, a new 'text' column is constructed utilizing one CSV file.

Let's get a better understanding of the data by printing the first two rows to the console.

pre.head(2)
Text and Summary

Step 2: Cleaning the Data

The data you procured can have non-alphabetic characters which you can remove before training the model. To do this, you can use the re (Regular Expressions) library.

import re

# Remove non-alphabetic characters (Data Cleaning)
def text_strip(column):

    for row in column:
        row = re.sub("(\\t)", " ", str(row)).lower()
        row = re.sub("(\\r)", " ", str(row)).lower()
        row = re.sub("(\\n)", " ", str(row)).lower()

        # Remove _ if it occurs more than one time consecutively
        row = re.sub("(__+)", " ", str(row)).lower()

        # Remove - if it occurs more than one time consecutively
        row = re.sub("(--+)", " ", str(row)).lower()

        # Remove ~ if it occurs more than one time consecutively
        row = re.sub("(~~+)", " ", str(row)).lower()

        # Remove + if it occurs more than one time consecutively
        row = re.sub("(\+\++)", " ", str(row)).lower()

        # Remove . if it occurs more than one time consecutively
        row = re.sub("(\.\.+)", " ", str(row)).lower()

        # Remove the characters - <>()|&©ø"',;?~*!
        row = re.sub(r"[<>()|&©ø\[\]\'\",;?~*!]", " ", str(row)).lower()

        # Remove mailto:
        row = re.sub("(mailto:)", " ", str(row)).lower()

        # Remove \x9* in text
        row = re.sub(r"(\\x9\d)", " ", str(row)).lower()

        # Replace INC nums to INC_NUM
        row = re.sub("([iI][nN][cC]\d+)", "INC_NUM", str(row)).lower()

        # Replace CM# and CHG# to CM_NUM
        row = re.sub("([cC][mM]\d+)|([cC][hH][gG]\d+)", "CM_NUM", str(row)).lower()

        # Remove punctuations at the end of a word
        row = re.sub("(\.\s+)", " ", str(row)).lower()
        row = re.sub("(\-\s+)", " ", str(row)).lower()
        row = re.sub("(\:\s+)", " ", str(row)).lower()

        # Replace any url to only the domain name
        try:
            url = re.search(r"((https*:\/*)([^\/\s]+))(.[^\s]+)", str(row))
            repl_url = url.group(3)
            row = re.sub(r"((https*:\/*)([^\/\s]+))(.[^\s]+)", repl_url, str(row))
        except:
            pass

        # Remove multiple spaces
        row = re.sub("(\s+)", " ", str(row)).lower()

        # Remove the single character hanging between any two spaces
        row = re.sub("(\s+.\s+)", " ", str(row)).lower()

        yield row
Note: You can use other data cleaning methods to prepare your data as well.

Call the text_strip() function on both the text and summary.

processed_text = text_strip(pre['text'])
processed_summary = text_strip(pre['summary'])

Load the data as batches using the pipe() method provided by spaCy. This ensures that all pieces of text and summaries possess the string data type.

import spacy
from time import time

nlp = spacy.load('en', disable=['ner', 'parser']) 

# Process text as batches and yield Doc objects in order
text = [str(doc) for doc in nlp.pipe(processed_text, batch_size=5000)]

summary = ['_START_ '+ str(doc) + ' _END_' for doc in nlp.pipe(processed_summary, batch_size=5000)]

The _START_ and _END_ tokens denote the start and end of the summary, respectively. This will be used later to detect and remove empty summaries.

Now let's print some of the data to understand how it’s been loaded.

text[0]
# Output
'saurav kant an alumnus of upgrad and iiit-b s pg program in ...'

Print the summary as well.

summary[0]
# Output
'_START_ upgrad learner switches to career in ml  al with 90% salary hike _END_'

Step 3: Determining the Maximum Permissible Sequence Lengths

Next, store the text and summary lists in pandas objects.

pre['cleaned_text'] = pd.Series(text)
pre['cleaned_summary'] = pd.Series(summary)

Plot a graph to determine the frequency ranges tied to the lengths of text and summary, i.e., determine the range of length of words where the maximum number of texts and summaries fall into.

import matplotlib.pyplot as plt

text_count = []
summary_count = []

for sent in pre['cleaned_text']:
    text_count.append(len(sent.split()))
    
for sent in pre['cleaned_summary']:
    summary_count.append(len(sent.split()))

graph_df = pd.DataFrame() 

graph_df['text'] = text_count
graph_df['summary'] = summary_count

graph_df.hist(bins = 5)
plt.show()

From the graphs, you can determine the range where the maximum number of words fall into. For summary, you can assign the range to be 0-15.

To find the range of text which we aren't able to clearly decipher from the graph, consider a random range and find the percentage of words falling into that range.

# Check how much % of text have 0-100 words
cnt = 0
for i in pre['cleaned_text']:
    if len(i.split()) <= 100:
        cnt = cnt + 1
print(cnt / len(pre['cleaned_text']))
# Output
0.9578389933440218

As you can observe, 95% of the text pieces fall into the 0-100 category.

Now initialize the maximum permissible lengths of both text and summary.

# Model to summarize the text between 0-15 words for Summary and 0-100 words for Text
max_text_len = 100
max_summary_len = 15

Step 4: Selecting Plausible Texts and Summaries

Select texts and summaries which are below the maximum lengths as defined in Step 3.

# Select the Summaries and Text which fall below max length 

import numpy as np

cleaned_text = np.array(pre['cleaned_text'])
cleaned_summary= np.array(pre['cleaned_summary'])

short_text = []
short_summary = []

for i in range(len(cleaned_text)):
    if len(cleaned_summary[i].split()) <= max_summary_len and len(cleaned_text[i].split()) <= max_text_len:
        short_text.append(cleaned_text[i])
        short_summary.append(cleaned_summary[i])
        
post_pre = pd.DataFrame({'text': short_text,'summary': short_summary})

post_pre.head(2)

Now add start of the sequence (sostok) and end of the sequence (eostok) to denote start and end of the summaries, respectively. This shall be useful to trigger the start of summarization during the inferencing phase.

# Add sostok and eostok

post_pre['summary'] = post_pre['summary'].apply(lambda x: 'sostok ' + x \
        + ' eostok')

post_pre.head(2)

Step 5: Tokenizing the Text

First split the data into train and test data chunks.

from sklearn.model_selection import train_test_split

x_tr, x_val, y_tr, y_val = train_test_split(
    np.array(post_pre["text"]),
    np.array(post_pre["summary"]),
    test_size=0.1,
    random_state=0,
    shuffle=True,
)

Prepare and tokenize the text data.

# Tokenize the text to get the vocab count 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Prepare a tokenizer on training data
x_tokenizer = Tokenizer() 
x_tokenizer.fit_on_texts(list(x_tr))

Find the percentage of occurrence of rare words (say, occurring less than 5 times) in the text.

thresh = 5

cnt = 0
tot_cnt = 0

for key, value in x_tokenizer.word_counts.items():
    tot_cnt = tot_cnt + 1
    if value < thresh:
        cnt = cnt + 1
    
print("% of rare words in vocabulary: ", (cnt / tot_cnt) * 100)
# Output
% of rare words in vocabulary:  62.625791318822664

Tokenize the text again by considering the total number of words minus the rare occurrences. Convert text to numbers and pad them all to the same length.

# Prepare a tokenizer, again -- by not considering the rare words
x_tokenizer = Tokenizer(num_words = tot_cnt - cnt) 
x_tokenizer.fit_on_texts(list(x_tr))

# Convert text sequences to integer sequences 
x_tr_seq = x_tokenizer.texts_to_sequences(x_tr) 
x_val_seq = x_tokenizer.texts_to_sequences(x_val)

# Pad zero upto maximum length
x_tr = pad_sequences(x_tr_seq,  maxlen=max_text_len, padding='post')
x_val = pad_sequences(x_val_seq, maxlen=max_text_len, padding='post')

# Size of vocabulary (+1 for padding token)
x_voc = x_tokenizer.num_words + 1

print("Size of vocabulary in X = {}".format(x_voc))
# Output
Size of vocabulary in X = 29638

Do the same for the summaries as well.

# Prepare a tokenizer on testing data
y_tokenizer = Tokenizer()   
y_tokenizer.fit_on_texts(list(y_tr))

thresh = 5

cnt = 0
tot_cnt = 0

for key, value in y_tokenizer.word_counts.items():
    tot_cnt = tot_cnt + 1
    if value < thresh:
        cnt = cnt + 1
    
print("% of rare words in vocabulary:",(cnt / tot_cnt) * 100)

# Prepare a tokenizer, again -- by not considering the rare words
y_tokenizer = Tokenizer(num_words=tot_cnt-cnt) 
y_tokenizer.fit_on_texts(list(y_tr))

# Convert text sequences to integer sequences 
y_tr_seq = y_tokenizer.texts_to_sequences(y_tr) 
y_val_seq = y_tokenizer.texts_to_sequences(y_val) 

# Pad zero upto maximum length
y_tr = pad_sequences(y_tr_seq, maxlen=max_summary_len, padding='post')
y_val = pad_sequences(y_val_seq, maxlen=max_summary_len, padding='post')

# Size of vocabulary (+1 for padding token)
y_voc = y_tokenizer.num_words + 1

print("Size of vocabulary in Y = {}".format(y_voc))
# Output
% of rare words in vocabulary: 62.55667945587723
Size of vocabulary in Y = 12883

Step 6: Removing Empty Texts and Summaries

Remove all empty summaries (which only have START and END tokens) and their associated texts from the data.

# Remove empty Summaries, .i.e, which only have 'START' and 'END' tokens
ind = []

for i in range(len(y_tr)):
    cnt = 0
    for j in y_tr[i]:
        if j != 0:
            cnt = cnt + 1
    if cnt == 2:
        ind.append(i)

y_tr = np.delete(y_tr, ind, axis=0)
x_tr = np.delete(x_tr, ind, axis=0)

Repeat the same for the validation data as well.

# Remove empty Summaries, .i.e, which only have 'START' and 'END' tokens
ind = []
for i in range(len(y_val)):
    cnt = 0
    for j in y_val[i]:
        if j != 0:
            cnt = cnt + 1
    if cnt == 2:
        ind.append(i)

y_val = np.delete(y_val, ind, axis=0)
x_val = np.delete(x_val, ind, axis=0)

We'll continue this in Part 2, where we build the model, train it, and do inference.

Conclusion

So far in this tutorial, you've seen what an Encoder-Decoder Sequence-to-Sequence model is along with its architecture. You've also prepared the news summary dataset to train your own seq2seq model in order to summarize given pieces of text.

In the next part of this two-part series, you'll build, train, and test the model.

Add speed and simplicity to your Machine Learning workflow today

Get startedContact Sales

Spread the word