ilsp
/

Portuguese-English Translation Model for the Scientific Domain

Description

This is a CTranslate2 Portuguese-English translation model for the scientific domain, which uses the PT-EN OPUS-MT Transformer-Align (link) as its base model. It has been fine-tuned on a large parallel corpus with scientific texts, with special focus to the four pilot domains of the SciLake project:

  • Neuroscience
  • Cancer
  • Transportation
  • Energy

Dataset

The fine-tuning dataset consists of 5,705,469 EN-PT parallel sentences extracted from parallel theses and abstracts which have been acquired from multiple academic repositories.

Evaluation

We have evaluated the base and the fine-tuned models on 5 test sets:

  • Four which correspond to the pilot domains (Neuroscience, Cancer, Transportation, Energy) with each one containing 1,000 parallel sentences.
  • A general scientific which contains 3,000 parallel sentences from a wide range of scientific texts in other domains.
Model Average of 4 domains General Scientific
SacreBLEU chrF2++ COMET SacreBLEU chrF2++ COMET
Base 46 68.3 66.7 44.9 67.7 66.3
Fine-Tuned 48.4 69.9 67.3 47.3 69.1 67.8
Improvement +2.4 +1.6 +0.9 +2.4 +1.4 +1.5

Usage

pip install ctranslate2 sentencepiece huggingface_hub
import ctranslate2
import sentencepiece as spm
from huggingface_hub import snapshot_download

repo_id = "ilsp/opus-mt-pt-en_ct2_ft-SciLake"

# REPLACE WITH ACTUAL LOCAL DIRECTORY WHERE THE MODEL WILL BE DOWNLOADED
local_dir = ""

model_path = snapshot_download(repo_id=repo_id, local_dir=local_dir)

translator = ctranslate2.Translator(model_path, compute_type="auto")

sp_enc = spm.SentencePieceProcessor()
sp_enc.load(f"{model_path}/source.spm")

sp_dec = spm.SentencePieceProcessor()
sp_dec.load(f"{model_path}/target.spm")

def translate_text(input_text, sp_enc=sp_enc, sp_dec=sp_dec, translator=translator, beam_size=6):
    input_tokens = sp_enc.encode(input_text, out_type=str)
    results = translator.translate_batch([input_tokens],
                                         beam_size=beam_size,
                                         length_penalty=0,
                                         max_decoding_length=512,
                                         replace_unknowns=True)
    output_tokens = results[0].hypotheses[0]
    output_text = sp_dec.decode(output_tokens)
    return output_text
    
input_text = "Na osteoartríte (OA) a degeneração progressiva das estruturas articulares activa continuamente nociceptores levando ao desenvolvimento de dor crónica e a déficits emocionais e cognitivos."
translate_text(input_text)

# OUTPUT
# In osteoarthritis (OA), progressive degeneration of articular structures continuously activates nociceptors leading to the development of chronic pain and emotional and cognitive deficits.

Acknowledgements

This work was created within the SciLake project. We are grateful to the SciLake project for providing the resources and support that made this work possible. This project has received funding from the European Union’s Horizon Europe framework programme under grant agreement No. 101058573.

Downloads last month
129
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.