metadata

language:
  - en
  - es
  - multilingual
tags:
  - translation
datasets:
  - ted_talks_iwslt
metrics:
  - rouge

Subtitle Translation Model

This is a model for text translation between Spanish and English texts. It has been trained with Spanish and English Ted Talks transcriptions from ted_talks_iwslt, finetuning the Helsinki-NLP/opus-mt-en-mul model.

Intended Use

This model has been trained with the intention of building a tool for subtitle translation.

Data

The dataset has been split into the following structure:

DatasetDict({
    train: Dataset({
        features: ['Original_Sentence', 'Translate_SP', '__index_level_0__'],
        num_rows: 2454
    })
    validation: Dataset({
        features: ['Original_Sentence', 'Translate_SP', '__index_level_0__'],
        num_rows: 307
    })
    test: Dataset({
        features: ['Original_Sentence', 'Translate_SP', '__index_level_0__'],
        num_rows: 307
    })
})

Note: Evaluation numbers have been obtained using 50 samples from test set.

Relevant Training Arguments

    evaluation_strategy = "epoch"
    learning_rate=2e-5
    per_device_train_batch_size=4
    per_device_eval_batch_size=4
    weight_decay=0.01
    save_total_limit=3
    num_train_epochs=1
    predict_with_generate=True
    fp16=False

Evaluation Results

The following results show the rouge metrics obtained during the training process (evaluation of the hiperparameters) and the evaluation of the model itself with the test set.

Eval metrics

{'rouge1': 64.95, 'rouge2': 42.24, 'rougeL': 61.97, 'rougeLsum': 62.93}

Test set evaluation (50 transcriptions)

{'rouge1': 65.54,'rouge2': 41.45,'rougeL': 62.72,'rougeLsum': 62.83}

Using the model

This model can be easily used with the following lines of code:

from transformers import pipeline
pipe = pipeline(model="razwand/opus-mt-en-mul-finetuned_en_sp_translator")
pipe("Hi everyone!")

>>[{'translation_text': 'Hola a todos!'}]