# WiDS 2023: Language Translation Model
This Jupyter notebook contains the code for the language translation model using a pre-trained Transformer based Neural Networks and NLP from Hugging face ðŸ¤—. It includes steps for fine-tuning the model on a specific dataset, preprocessing the data, training, and evaluating the model, ultimately providing a user interactive interface for language translation using Gradio.

Initially, we establish a connection to the GPU to optimize the execution of our program, leveraging its capacity to efficiently process tasks involving substantial datasets.

In [1]:
#checking whether GPU is working or not
!nvidia-smi

Sat Jan 13 08:47:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8              13W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
#installing all the necessary libraries
! pip install -q transformers accelerate sentencepiece gradio datasets evaluate sacrebleu

Importing all the necessary libaries and module.

In [3]:
import evaluate
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import pipeline
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
from huggingface_hub import notebook_login

We download the "kde4" dataset from Hugging Face, a curated dataset designed for language translation. It's essential to specify the two languages involved in the translation process when obtaining this dataset.

In [4]:
raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
raw_datasets

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

The dataset initially includes only a training set, but for a comprehensive evaluation of our model's performance, we need both training and testing sets. To achieve this, we employ the train_test_split function, effectively partitioning the dataset into distinct training and testing subsets for subsequent model assessment.

In [5]:
split_datasets= raw_datasets["train"].train_test_split(test_size=0.2, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 168138
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 42035
    })
})

In [6]:
split_datasets["train"][45]["translation"]

{'en': 'If you choose the wrong settings here your articles could be unreadable or not sendable at all, so please be careful with these settings.',
 'fr': 'Si vous choisissez ici les mauvais paramÃ¨tres, vos articles peuvent devenir illisibles ou vous ne pourrez pas du tout les envoyer. Veuillez donc Ãªtre prudent avec ces paramÃ¨tres.'}

We will utilize a pre-trained model available on HuggingFace, specifically the Helsinki-NLP/opus-mt-en-fr model. This model has been pre-trained to facilitate translation tasks from English to French, and we will leverage its capabilities for our language translation project.

In [7]:
model="Helsinki-NLP/opus-mt-en-fr"
translator=pipeline("translation", model=model)
translator("If you choose the wrong settings here your articles could be unreadable or not sendable at all, so please be careful with these settings.")



[{'translation_text': "Si vous choisissez les mauvais paramÃ¨tres ici, vos articles pourraient Ãªtre illisibles ou ne pas Ãªtre envoyÃ©s du tout, alors s'il vous plaÃ®t soyez prudent avec ces paramÃ¨tres."}]

The initial results from the pre-trained model demonstrate reasonably accurate translations. Further refinement through fine-tuning is expected to enhance the translation quality even more.

Next, we employ the AutoTokenizer to apply the same tokenization scheme used in the pre-trained model to process the dataset.

In [8]:
tokenizer=AutoTokenizer.from_pretrained(model, return_tensors="pt")

In [9]:
def pre_processtext(text):
  inputs=[sample['en'] for sample in text['translation']]
  output=[sample['fr'] for sample in text['translation']]
  tokenized_text=tokenizer(inputs, text_target=output, max_length=128, truncation=True) #(text_target because if not done it will tokenize the french sentence according to english and so the labels will then not be correct)
  return tokenized_text

In [10]:
tokenized_datasets=split_datasets.map(
    pre_processtext,
    batched=True,
    remove_columns=split_datasets["train"].column_names #(to remove extra columns)
)

Following preprocessing, the next step involves selecting a model for training, and in this case, the choice is the AutoModelForSeq2SeqLM.

In [11]:
model_1= AutoModelForSeq2SeqLM.from_pretrained(model)

The data collator plays a crucial role, facilitating dynamic padding, appending -100 to short sentences for length matching, and incorporating a start-of-sentence token, visible in decoder_input_ids.

In [12]:
data_collator=DataCollatorForSeq2Seq(tokenizer,model=model_1)

In [13]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1,3)])
print(batch.keys())
print(batch['labels'])
batch['decoder_input_ids']

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])
tensor([[25966,    19,   540,     8,   669, 33355,    24, 11106,    37,   583,
           583,  9507, 10571,     3,    49, 19015,     3,    49,  1937,    74,
          2635,   973,   529, 13518,    74,   102,     0],
        [14743,   301,   548,     0,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100]])


tensor([[59513, 25966,    19,   540,     8,   669, 33355,    24, 11106,    37,
           583,   583,  9507, 10571,     3,    49, 19015,     3,    49,  1937,
            74,  2635,   973,   529, 13518,    74,   102],
        [59513, 14743,   301,   548,     0, 59513, 59513, 59513, 59513, 59513,
         59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513, 59513,
         59513, 59513, 59513, 59513, 59513, 59513, 59513]])

To assess our model, we employ the sacrebleu score, which focuses on word matching between translations and references. This metric doesn't scrutinize grammatical correctness but penalizes repetitive words not present in the original translation.

In [14]:
metric_evaluate= evaluate.load("sacrebleu")

In [15]:
def compute_metrics(eval):
  preds, labels= eval
  if isinstance(preds, tuple): #if model returns more than the prediction logits
    preds=preds[0]
  decoded_preds= tokenizer.batch_decode(preds, skip_special_tokens=True)

  labels=np.where(labels != -100, labels,tokenizer.pad_token_id) #replacing -100 as we will not be able to decode them
  decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)

  decoded_preds=[pred.strip() for pred in decoded_preds]
  decoded_labels=[[label.strip()] for label in decoded_labels] #references should be list of list of sentences

  result=metric_evaluate.compute(predictions=decoded_preds, references=decoded_labels)
  return {"bleu": result["score"]}

To preserve my model, I'll utilize the Hugging Face repository. Let's proceed by logging into the Hugging Face platform.

In [16]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

To fine-tune and train our dataset using a pre-trained model, we'll leverage the Seq2SeqTrainingArguments and Seq2SeqTrainer, configuring the relevant parameters to ensure the model's effectiveness can be assessed.

In [17]:
arg= Seq2SeqTrainingArguments(
    f"eng-to-fra-model",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True #(for saving my model onto huggingface repository)
)

In [18]:
trainer = Seq2SeqTrainer(
    model=model_1,
    args=arg,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [19]:
#training starts form here
trainer.train()

Step,Training Loss
500,1.3652
1000,1.2262
1500,1.156
2000,1.1086
2500,1.0849
3000,1.0331
3500,1.0207
4000,1.0096
4500,0.9949
5000,0.9728


TrainOutput(global_step=15765, training_loss=0.9012604572793396, metrics={'train_runtime': 2950.8187, 'train_samples_per_second': 170.94, 'train_steps_per_second': 5.343, 'total_flos': 1.008207288336384e+16, 'train_loss': 0.9012604572793396, 'epoch': 3.0})

In [20]:
trainer.push_to_hub(tags="translation", commit_message="Training complete") #To save the latest model onto the repository

CommitInfo(commit_url='https://huggingface.co/rajbhirud/eng-to-fra-model/commit/7dc6032cdedafc309f004b8d65493fbfe40fd5b7', commit_message='Training complete', commit_description='', oid='7dc6032cdedafc309f004b8d65493fbfe40fd5b7', pr_url=None, pr_revision=None, pr_num=None)

In [21]:
# we can check the score of our model through the following code
trainer.evaluate(max_length=128)

{'eval_loss': 0.8449926376342773,
 'eval_bleu': 53.45040267621567,
 'eval_runtime': 3010.6388,
 'eval_samples_per_second': 13.962,
 'eval_steps_per_second': 0.218,
 'epoch': 3.0}

To observe the model's performance interactively, particularly in language translation, we can leverage Gradio. A ready-to-use script, "gradio_eng_to_fra.py", has been provided in the repository. Executing this file enables seamless integration with the Gradio interface, offering users an intuitive platform for language translation without the need for extensive coding.