English to Spanish Machine Translation

Introduction:

In this project, we'll build a sequence-to-sequence Transformer model, which we'll train on an English-to-Spanish machine translation task.

In this task we will learn:

Vectorize text using the Keras TextVectorization layer.
Implement a TransformerEncoder layer, a TransformerDecoder layer, and a PositionalEmbedding layer.
Prepare data for training a sequence-to-sequence model.
Use the trained model to generate translations of never-seen-before input sentences (sequence-to-sequence inference).

Dataset Collection:

We'll be working with an English-to-Spanish translation dataset provided by Anki from this source: "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"

Dependencies

numpy
keras
tensorflow

Data Processing

Every line has a sentence in English and a comparable sentence in Spanish. The target sequence is the Spanish sentence, and the source sequence is the English sentence. To the Spanish sentence, we prepend the token "[start]" and attach the token "[end]".
To vectorize the text data, we will utilize two instances of the TextVectorization layer (one for English and one for Spanish). This means that instead of the original strings, we will convert them into integer sequences, where each integer is the index of a word in a vocabulary.
At each training step, the model will seek to predict target words N+1 (and beyond) using the source sentence and the target words 0 to N.

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

We have batches of 64 pairs, and all sequences are 20 steps long.

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)

Model Architecture:

The model architecture consists of an Encoder-Decoder LSTM network with an embedding layer.To make the model aware of word order, we also use a PositionalEmbedding layer.

The TransformerEncoder will receive the source sequence and create a new representation of it. The target sequence up to this point (target words 0 to N) will be delivered to the TransformerDecoder together with this updated representation. Next, the TransformerDecoder will try to anticipate words N+1 and beyond in the target sequence. Since the TransformerDecoder views all of the sequences at once, we have to make sure that when it predicts token N+1, it only takes information from target tokens 0 to N. If we don't, it might use information from the future, which would produce a model that is unusable at inference time.

Training the Model:

Accuracy will be used as a fast approach to track training results on validation data. Keep in mind that BLEU scores and other measures are usually used by machine translation algorithms, not accuracy alone.

In this case, we are only training for one epoch; however, you need train for at least thirty epochs in order to get the model to converge.

epochs = 1  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Model: "transformer"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃ Param # ┃ Connected to         ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ encoder_inputs      │ (None, None)      │       0 │ -                    │
│ (InputLayer)        │                   │         │                      │
├─────────────────────┼───────────────────┼─────────┼──────────────────────┤
│ positional_embeddi… │ (None, None, 256) │ 3,845,… │ encoder_inputs[0][0] │
│ (PositionalEmbeddi… │                   │         │                      │
├─────────────────────┼───────────────────┼─────────┼──────────────────────┤
│ decoder_inputs      │ (None, None)      │       0 │ -                    │
│ (InputLayer)        │                   │         │                      │
├─────────────────────┼───────────────────┼─────────┼──────────────────────┤
│ transformer_encoder │ (None, None, 256) │ 3,155,… │ positional_embeddin… │
│ (TransformerEncode… │                   │         │                      │
├─────────────────────┼───────────────────┼─────────┼──────────────────────┤
│ functional_5        │ (None, None,      │ 12,959… │ decoder_inputs[0][0… │
│ (Functional)        │ 15000)            │         │ transformer_encoder… │
└─────────────────────┴───────────────────┴─────────┴──────────────────────┘
 Total params: 19,960,216 (76.14 MB)
 Trainable params: 19,960,216 (76.14 MB)
 Non-trainable params: 0 (0.00 B)

Result Analysis:

The vectorized English text and the goal token "[start]" are simply fed into the model. We then continuously produce the following token until we reach the token "[end]".

She handed him the money. [start] ella le pasó el dinero [end]

Tom has never heard Mary sing. [start] tom nunca ha oído cantar a mary [end]

Perhaps she will come tomorrow. [start] tal vez ella vendrá mañana [end]

I love to write. [start] me encanta escribir [end]

His French is improving little by little. [start] su francés va a [UNK] sólo un poco [end]

My hotel told me to call you. [start] mi hotel me dijo que te [UNK] [end]

Contributor

Janaatul Ferdaws Amrin (amrincse26@gmail.com)

License

This project is licensed under the MIT License - see the LICENSE file for details.