Sequence to Sequence Model for Deep Learning with Keras

What is Seq2seq Learning?

Sequence to sequence learning involves building a model where data in a domain can be converted to another domain, following the input data sequence. Seq2seq, as it is called for short, is especially useful in Natural Language Processing for language translation. If you’ve used popular language translators like Google Translate, you’d realize that as you type a word in a language (say English), it converts that word to another language (say French) in real-time following the sequence of input words. If you change any word in English, there is a corresponding change in the French word. A seq2seq model does this.

I am learning how to build models -> [seq2seq] -> J’apprends à construire des modèles

Seq2seq model can be used for other applications such as conversational models, image captioning, text summarization, and more. In this tutorial, we will focus on building a seq2seq model that does language translation using Keras in Python.

A seq2seq model has two important components: the encoder and the decoder. And that’s why the Seq2seq model can also be called the encoder-decoder model. The encoder maps the sequence of input to the decoder to return a sequence of output.

There are many ways to build the neural network architecture of the encoder and decoder, depending upon how you want to apply it. For image captioning, a convolution neural network (CNN) with a flattened final layer is typically used. For language translation models, a recurrent neural network (RNN) is used. Since there are various variations of RNN models, you would need to determine the kind of RNN to apply. The two common RNN models are LSTM (long short term memory) or GRU (gated recurrent units).

In this project, we shall use LSTM to build a seq2seq model for machine translation.

Let’s start by understanding how the sequence to sequence models work.

How Seq2seq Works

1. RNN layer which serves as the encoder: The encoder receives a sequence as input and returns it’s own internal state. This process continues until the end of the sequence. Note that the outputs of the encoder layers (hidden state) are discarded until the final layer. Only the cell state (more like the memory of the layer) of one layer is passed unto the next. The output of the final state is called the context or conditioning of the decoder. This final output of the encoder is used as the initial input of the decoder.

2. RNN layer which serves as the decoder: The decoder is trained to return the target characters of the data but with an offset time, i.e. in the future. Let’s put it in another way. The decoder is trained to predict the next character, given the previous character. Or we can say, the decoder is trained to return the target t+1 given the targets t, as defined by the input sequence.

With the context received from the encoder as input, each state of the decoder returns an output that serves as input for the next state. The output of the decoder is the characters at each time step.

After training the model, the next step is to carry inference on the model. In the inference model, the model predicts the sequence of characters for a completely new input sequence.

To do this we’d start by encoding the sequence into state vectors. Then we feed the state vectors and a single character target sequence into the decoder to generate predictions for the next character. Afterward, argmax is used to sample the predicted character and append it to the target sequence. The process is repeated until it gets to the end of the sequence or perhaps, reaches a defined limit of characters.

That’s how the seq2seq model works. Now that we have an understanding of how encoding, decoding, and inference operates, let’s take a coding example.

A Seq2seq Model Example: Building a Machine Translator.

In the Keras official blog, the author of the Keras library, Francois Chollet, wrote an article that details how to implement an LSTM-based sequence to sequence model to make predictions. In this post, we’ll be discussing how to build such models and specifically use them for machine translation. The LSTM model will predict Spanish texts given a sequence of the input text. The data used for this project was gotten from manythings.org/anki. In the dataset, the input sequence were English words while the output sequence were Spanish words. You can download the dataset here.

Here’s how the dataset looks like.

Go. Ve. CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986655 (cueyayotl)
Go. Vete. CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986656 (cueyayotl)
Go. Vaya. CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986657 (cueyayotl)
Go. Váyase. CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #6586271 (arh)
Hi. Hola. CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #431975 (Leono)
Run! ¡Corre! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #1685404 (Elenitigormiti)
Run! ¡Corran! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #5213896 (cueyayotl)
Run! ¡Corra! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #8005613 (Seael)
Run! ¡Corred! CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #8005615 (Seael)
Run. Corred. CC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #6681472 (arh)
He laughed. Él se reía. CC-BY 2.0 (France) Attribution: tatoeba.org #299650 (Sprachprofi) & #745277 (Shishir)
He made it. Lo hizo él. CC-BY 2.0 (France) Attribution: tatoeba.org #300301 (CK) & #6682410 (arh)
He made it. Lo logró. CC-BY 2.0 (France) Attribution: tatoeba.org #300301 (CK) & #6682411 (arh)
He made it. Lo hizo. CC-BY 2.0 (France) Attribution: tatoeba.org #300301 (CK) & #6682413 (arh)

The Modelling Process in a nutshell

We have reiterated the fact that machine learning algorithms work with numbers alone and not strings. Thus, we would need to convert the sentences into NumPy arrays. The encoder-decoder LSTM model will require 3 important data in numerical array format: encoder_input_data, decoder_input_data, decoder_target_data.

Let’s understand what each of these NumPy arrays is.

The input data of the encoder (encoder_input_data): This is a 3-dimensional array, containing the number of pairs in English (input_texts), the maximum length of the English sentence (max_encoder_seq_length), and the number of the English characters (num_encoder_tokens) in the data. Remember that the data is a one-hot encoding representation of the sentences in English.
The input data of the decoder (decoder_input_data): This is a 3-dimensional array containing the number of pairs in Spanish (input_texts), the maximum length of the Spanish sentences (max_decoder_seq_length), and the number of Spanish characters (num_decoder_tokens). Again, the data is represented as a one-hot encoding of the Spanish sentences.
The target data of the decoder (decoder_target_data): This data is the same as the input data of the decoder, only that that returns the next character at an offset time t + 1. Putting it differently, the input data of the decoder at time t, (decoder_target_data[:, t, :]) is same as the target data of the decoder at time t+1 (decoder_input_data[:, t + 1, :])

Once we have all 3 data, we proceed to train a simple LSTM based sequence to sequence model that predicts the target data of the decoder given then input data of the encoder and input data of the decoder.
Finally, we carry out inferences by trying out the model on new sentences to see if it can decide the sentences with high accuracy.

Although the training and inference process makes use of similar RNN layers, there are two different models and must be built separately. We will begin by building the training model.

The Training Model

We’d start by importing the necessary libraries and defining the necessary libraries and some parameters that need to be defined when training the model.

#import the necessary libraries
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
import numpy as np

#define the batch size for training
batch_size =  70 
#define the number of epoch for training
epochs = 40 
#define the dimensionality of the encoding process
latent_dim = 256 
#define the number of samples to train on
num_samples = 10000

Next, we would be reading the data to extract the input texts (sentences in English), target texts (sentences in Spanish), input characters (words in English),target characters (words in Spanish).

#define the file location
data_path = r"C:\Users\wale obembe\Downloads\Compressed\spa.txt"
#define an empty list to store the words/sentences in English
input_texts = []
#define an empty list to store the characters in Spanish
target_texts = []
#define a set to store the words/sentences in English
#a set data type and not a list is used to avoid repetition of characters
input_characters = set()
#define a set to store the characters in Spanish
target_characters = set()
#read the data file and parse by each line
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

for line in lines[: min(num_samples, len(lines) - 1)]:
    #split each line into the input text, target text and the other unnecessary text
    input_text, target_text, _ = line.split('\t')
    #we use tab as the start sequence character for the target, and \n as end sequence character
    target_text = '\t' + target_text + '\n'
    #append the English words for each text to the empty list
    input_texts.append(input_text)
    #append the Spanish words for each text to the empty list
    target_texts.append(target_text)
    #select the letters or words in English
    for char in input_text:
        #check if the letter is not in the list
        if char not in input_characters:
            #append the letter that is not on the list
            input_characters.add(char)
    #select the letters or words in Spanish
    for char in target_text:
        #check if the letter or word is not in the list
        if char not in target_characters:
            #append the letter or word that is not on the list
            target_characters.add(char)

Let’s print out these variables

print(input_characters)

Output:

[' ', '!', '$', "'", ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

print(target_characters)

Output:

['\t', ' ', '!', '"', "'", ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¡', '«', '»', '¿', 'Á', 'É', 'Ó', 'Ú', 'á', 'é', 'í', 'ñ', 'ó', 'ú', 'ü']

print(input_texts[: 50])

Output:

['Go.', 'Go.', 'Go.', 'Go.', 'Hi.', 'Run!', 'Run!', 'Run!', 'Run!', 'Run.', 'Who?', 'Wow!', 'Fire!', 'Fire!', 'Fire!', 'Help!', 'Help!', 'Help!', 'Jump!', 'Jump.', 'Stop!', 'Stop!', 'Stop!', 'Wait!', 'Wait.', 'Go on.', 'Go on.', 'Hello!', 'Hurry!', 'Hurry!', 'Hurry!', 'I hid.', 'I hid.', 'I hid.', 'I hid.', 'I ran.', 'I ran.', 'I try.', 'I won!', 'Oh no!', 'Relax.', 'Shoot!', 'Shoot!', 'Shoot!', 'Shoot!', 'Shoot!', 'Shoot!', 'Smile.', 'Attack!', 'Attack!']

print(target_texts[:50])

Output:

['\tVe.\t', '\tVete.\t', '\tVaya.\t', '\tVáyase.\t', '\tHola.\t', '\t¡Corre!\t', '\t¡Corran!\t', '\t¡Corra!\t', '\t¡Corred!\t', '\tCorred.\t', '\t¿Quién?\t', '\t¡Órale!\t', '\t¡Fuego!\t', '\t¡Incendio!\t', '\t¡Disparad!\t', '\t¡Ayuda!\t', '\t¡Socorro! ¡Auxilio!\t', '\t¡Auxilio!\t', '\t¡Salta!\t', '\tSalte.\t', '\t¡Parad!\t', '\t¡Para!\t', '\t¡Pare!\t', '\t¡Espera!\t', '\tEsperen.\t', '\tContinúa.\t', '\tContinúe.\t', '\tHola.\t', '\t¡Date prisa!\t', '\t¡Daos prisa!\t', '\tDese prisa.\t', '\tMe oculté.\t', '\tMe escondí.\t', '\tMe ocultaba.\t', '\tMe escondía.\t', '\tCorrí.\t', '\tCorría.\t', '\tLo intento.\t', '\t¡He ganado!\t', '\t¡Oh, no!\t', '\tTomátelo con soda.\t', '\t¡Fuego!\t', '\t¡Disparad!\t', '\t¡Disparen!\t', '\t¡Dispara!\t', '\t¡Dispará!\t', '\t¡Dispare!\t', '\tSonríe.\t', '\t¡Al ataque!\t', '\t¡Atacad!\t']

Let’s explicitly define the number of English words, the number of Spanish words, the number of characters in the longest English sentence and the number of characters in the longest Spanish sentence. We would be needing these variables later in our model.

#define the sorted list of English words
input_characters = sorted(list(input_characters))
#define the sorted list of Spanish words
target_characters = sorted(list(target_characters))
#define the number of English words
num_encoder_tokens = len(input_characters)
#define the number of Spanish words
num_decoder_tokens = len(target_characters)
#define the maximum length of the English sentences
max_encoder_seq_length = max([len(txt) for txt in input_texts])
#define the maximum length of the Spanish sentences
max_decoder_seq_length = max([len(txt) for txt in target_texts])

Print the result…

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs", max_decoder_seq_length)

Output:

Number of samples: 10000
Number of unique input tokens: 69
Number of unique output tokens: 83
Max sequence length for inputs: 16
Max sequence length for outputs 42

We can also define two variables that will hold the index and characters for both English and Spanish words in a dictionary data type.

#index each characters in English
input_token_index = dict(
[(char, i) for i, char in enumerate(input_characters)])
#index each characters in Spanish
target_token_index = dict(
[(char, i) for i, char in enumerate(target_characters)])

Going forward, we define the data for the 3 vital data that our training model will require, i.e the encoder_input_data, decoder_input_data, and decoder_target_data. Recall that they are a 3-dimensional one-hot encoding. To carry out the one-hot encoding process, let’s begin by populating the data with zeros.

#define the input data of the encoder as a 3-dimensional matrix populated with zeros
#the shape of the matrix is the length of the input text by the length of the max encoder by the number of encoder characters
#the np.zeros() function takes an argument of the specified dtype. Here we use float32
encoder_input_data = np.zeros(
(len(input_texts), max_encoder_seq_length, num_encoder_tokens
), dtype='float32')
#define the input data of the decoder as a 3-dimensional matrix populated with zeros
decoder_input_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens
), dtype='float32')
#define the target data of the decoder as a 3-dimensional matrix populated with zeros
decoder_target_data = np.zeros(
(len(input_texts), max_decoder_seq_length, num_decoder_tokens
), dtype='float32')

Next we would need to change the textual input texts as numerical vectors using one-hot encoding. The code below does that.

#parse the input and output texts
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        #define that the decoder_target_data is one time step ahead of the decoder_input_data by 
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # define the decoder_target_data to be ahead by one timestep and will not include the first character
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

Going further, we train the model. When training the RNN, some parameters must be carefully defined. Let’s discuss what these parameters mean.

The return_state parameter: When it is set to True, the RNN layer returns a list where the first entry is the outputs while the rest are the states of preceding cells. It is used to recover preceding encoder states.
The return_sequences parameter: By default, the RNN returns the output of only the last layer alongside the states of the other layers. When this parameter is set to True, it returns the entire sequence of the outputs. This is typically used for the decoder.
The initial_state parameter: This passes the encoder states to the decoder for its initial state.

Let’s go ahead to define the encoder.

We first define the encoder input which is the English character sequence as one-hot encodings, whose length is equal to the number of encoder tokens. As explained earlier, the return_state parameter should be set to True for the encoder.

#define an input of the encoder with length as the number of encoder tokens
encoder_inputs = Input(shape=(None, num_encoder_tokens))
#instantiate the LSTM model
encoder = LSTM(latent_dim, return_state=True)
#define the outputs and states of the encoder
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
#disregard encoder_outputs and keep only the states
encoder_states = [state_h, state_c]

It’s now time to define the decoder.

As with the encoder, the input is a sequence of French characters as one-hot encodings, whose length is the number of decoder tokens. The LSTM is defined to return the output sequence of the states by setting the return_sequence to True.

The final hidden and cell state of the encoder is used to initialize the input of the decoder. Additionally, the Dense layer is used to predict the output of each character. Finally, the Model can be defined for the encoder input data, decoder input data and decoder output data.

#define an input of the encoder with length as the number of encoder tokens
decoder_inputs = Input(shape=(None, num_decoder_tokens))
#define the LSTM model for the decoder setting the return sequences and return state to True
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
#define only the decoder output for the training model. The states are only needed in the inference model
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
#define the training model which requires the encoder_input_data and decoder_input_data to return the decoder_target_data
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
#Train the model
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
         batch_size=batch_size,
         epochs=epochs,
         validation_split=0.2)

After about an hour, the model was successfully trained with a loss of 0.192 and an accuracy of 94.2%. Note that the accuracy of the training model can be increased by increasing the number of epochs.

Output:

Epoch 1/40
8000/8000 [==============================] - 111s 14ms/sample - loss: 1.4567 - acc: 0.6568 - val_loss: 1.3915 - val_acc: 0.6210
Epoch 2/40
8000/8000 [==============================] - 100s 12ms/sample - loss: 1.0924 - acc: 0.7033 - val_loss: 1.1186 - val_acc: 0.6872
Epoch 3/40
8000/8000 [==============================] - 94s 12ms/sample - loss: 0.8982 - acc: 0.7445 - val_loss: 0.9992 - val_acc: 0.7029
Epoch 4/40
8000/8000 [==============================] - 88s 11ms/sample - loss: 0.8080 - acc: 0.7601 - val_loss: 0.9034 - val_acc: 0.7278
Epoch 5/40
8000/8000 [==============================] - 95s 12ms/sample - loss: 0.7425 - acc: 0.7771 - val_loss: 0.8708 - val_acc: 0.7350
.
.
.
Epoch 35/40
8000/8000 [==============================] - 104s 13ms/sample - loss: 0.2295 - acc: 0.9309 - val_loss: 0.7207 - val_acc: 0.8126
Epoch 36/40
8000/8000 [==============================] - 87s 11ms/sample - loss: 0.2212 - acc: 0.9335 - val_loss: 0.7240 - val_acc: 0.8131
Epoch 37/40
8000/8000 [==============================] - 98s 12ms/sample - loss: 0.2129 - acc: 0.9362 - val_loss: 0.7271 - val_acc: 0.8124
Epoch 38/40
8000/8000 [==============================] - 89s 11ms/sample - loss: 0.2057 - acc: 0.9381 - val_loss: 0.7324 - val_acc: 0.8138
Epoch 39/40
8000/8000 [==============================] - 88s 11ms/sample - loss: 0.1985 - acc: 0.9405 - val_loss: 0.7427 - val_acc: 0.8130
Epoch 40/40
8000/8000 [==============================] - 87s 11ms/sample - loss: 0.1920 - acc: 0.9425 - val_loss: 0.7403 - val_acc: 0.8162

The Inference Model

After training the model, the next step is to use the model to make predictions. The model to make the prediction is called the inference model. This model is almost like the training model save some slight differences. The training model is not built to recursively return one character at a time. This, the Inference Model must do.

Even though the inference model is a different model from the training model, it would require reference to the features of the training model. To define the encoder model, we refer to the trained model and take the input layer from the encoder and output the states (cell and hidden state) of the encoder.

To define the decoder model, we define its initial state as the hidden and cell state of the newly created encoder model (the encoder of the inference model). This is important because this decoder is a different model and must take its initial input from the encoder. Having defined the input of the decoder, it can now be passed as the initial states of the LSTM layer.

The model is built such that the hidden and cell of the encoder is used as the initial state of the decoder. However, on subsequent calls, the initial state of the decoder is the hidden and cell state of the previous call. Hence, the model will have to output the hidden and cell state alongside the predicted character of that call.

#define the decoder input state as  a list of the hidden and cell state
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
#define the decoder output
decoder_outputs = decoder_dense(decoder_outputs)
#define the decoder model
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs, 
    [decoder_outputs] + decoder_states
)

# Reverse-lookup token index to decode sequence back to something readable
reverse_input_char_index = dict(
(i, char) for char, i in input_token_index.items()
)
reverse_target_char_index = dict(
(i, char) for char, i in target_token_index.items()
)

Now, we can tie it all together and define a function that decodes some text in English to output French text.

def decode_sequence(input_seq):
    #encode the input as state vectors
    states_value = encoder_model.predict(input_seq)
    
    #generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    #populate the first character of target sequence with the start character
    target_seq[0, 0, target_token_index['\t']] = 1.
    
    #sampling loop for a batch of sequences
    #to simplify, we use batch size of 1
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
        [target_seq] + states_value
        )
        
        #sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char
        
        #exit condition: either hit max length or find stop character
        if (len(decoded_sentence) > max_decoder_seq_length or sampled_char == '\n'):
            stop_condition = True
            
        #update the target sequence (of length 1)
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
        
        #update states
        states_value = [h, c]
    
    return decoded_sentence

Let’s call the function and check the result

for seq_index in range(100):
    #take one sequence (part of the training set) for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    call the function
    decoded_sentence = decode_sequence(input_seq)
    print()
    print(f"Input sentence: {input_texts[seq_index]}")
    print(f"Decoded sentence: {decoded_sentence}")

Output:

Input sentence: Go. Decoded sentence: Vaya. 
Input sentence: Run! Decoded sentence: ¡Corre! 
Input sentence: Who? Decoded sentence: ¿Quién es?
Input sentence: Fire! Decoded sentence: ¡Disparad! 
Input sentence: I care. Decoded sentence: Me preocupo. 
Input sentence: I fell. Decoded sentence: Me acuera.

Quite a decent result!

There you have it, an English-Spanish translation model built with seq2seq. If you have any questions, let us know in the comment section.