ilsp
/

opus-mt-big-fr-en_ct2_ft-SciLake

Inference Endpoints

Model card Files Files and versions Community

opus-mt-big-fr-en_ct2_ft-SciLake / README.md

droussis's picture

Update README.md

71fea99 verified 3 months ago

|

history blame contribute delete

No virus

3.88 kB

	---
	license: apache-2.0
	language:
	- fr
	- en
	pipeline_tag: translation
	---

	# French-English Translation Model for the Scientific Domain

	## Description

	This is a CTranslate2 French-English translation model for the scientific domain, which uses the FR-EN OPUS-MT Transformer-Big [(link)](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/fra-eng) as its base model.
	It has been fine-tuned on a large parallel corpus with scientific texts, with special focus to the four pilot domains of the [SciLake](https://scilake.eu/) project:
	- Neuroscience
	- Cancer
	- Transportation
	- Energy

	## Dataset

	The fine-tuning dataset consists of 1,850,031 EN-ES parallel sentences extracted from parallel theses and abstracts which have been acquired from multiple academic repositories.

	## Evaluation

	We have evaluated the base and the fine-tuned models on 5 test sets:
	- Four which correspond to the pilot domains (Neuroscience, Cancer, Transportation, Energy) with each one containing 1,000 parallel sentences.
	- A general scientific which contains 3,000 parallel sentences from a wide range of scientific texts in other domains.

	\| Model \| Average of 4 domains \| \| \| General Scientific\| \| \|
	\|-------------\|----------------------\|---------------\|---------------\|-------------------\|---------------\|---------------\|
	\| \| SacreBLEU \| chrF2++ \| COMET \| SacreBLEU \| chrF2++ \| COMET \|
	\| Base \| 37.6 \| 63.6 \| 57.5 \| 38.4 \| 63.3 \| 57.2 \|
	\| Fine-Tuned \| 40.2 \| 65.7 \| 59.9 \| 40.7 \| 65.3 \| 59.5 \|
	\| Improvement \| +2.6 \| +2.1 \| +2.4 \| +2.3 \| +2 \| +2.3 \|


	## Usage

	```
	pip install ctranslate2 sentencepiece huggingface_hub
	```

	```python
	import ctranslate2
	import sentencepiece as spm
	from huggingface_hub import snapshot_download

	repo_id = "ilsp/opus-mt-big-fr-en_ct2_ft-SciLake"

	# REPLACE WITH ACTUAL LOCAL DIRECTORY WHERE THE MODEL WILL BE DOWNLOADED
	local_dir = ""

	model_path = snapshot_download(repo_id=repo_id, local_dir=local_dir)

	translator = ctranslate2.Translator(model_path, compute_type="auto")

	sp_enc = spm.SentencePieceProcessor()
	sp_enc.load(f"{model_path}/source.spm")

	sp_dec = spm.SentencePieceProcessor()
	sp_dec.load(f"{model_path}/target.spm")

	def translate_text(input_text, sp_enc=sp_enc, sp_dec=sp_dec, translator=translator, beam_size=6):
	input_tokens = sp_enc.encode(input_text, out_type=str)
	results = translator.translate_batch([input_tokens],
	beam_size=beam_size,
	length_penalty=0,
	max_decoding_length=512,
	replace_unknowns=True)
	output_tokens = results[0].hypotheses[0]
	output_text = sp_dec.decode(output_tokens)
	return output_text

	input_text = "Les données radars transportées au sein du système ATM (Air Traffic Management) permet aux contrôleurs de l’aviation civile d’effectuer le contrôle aérien et ainsi la sécurité des avions et de leurs passagers."
	translate_text(input_text)

	# OUTPUT
	# Radar data carried within the ATM (Air Traffic Management) system allows civil aviation controllers to perform air traffic control and thus the safety of aircraft and their passengers.
	```

	## Acknowledgements

	This work was created within the [SciLake](https://scilake.eu/) project. We are grateful to the SciLake project for providing the resources and support that made this work possible. This project has received funding from the European Union’s Horizon Europe framework programme under grant agreement No. 101058573.