language:
- en
- cy
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
- cer
- wer
- wil
- wip
- chrf
license: apache-2.0
model-index:
- name: mt-dspec-health-en-cy
results:
- task:
name: Translation
type: translation
metrics:
- name: SacreBLEU
type: bleu
value: 54.16
- name: CER
type: cer
value: 0.31
- name: WER
type: wer
value: 0.47
- name: WIL
type: wil
value: 0.67
- name: WIP
type: wip
value: 0.33
- name: SacreBLEU CHRF
type: chrf
value: 69.03
mt-dspec-health-en-cy
A language translation model for translating between English and Welsh, specialised to the specific domain of Health and care.
This model was trained using custom DVC pipeline employing Marian NMT, the datasets prepared were generated from the following sources:
The data was split into train, validation and tests sets, the test set containing health-specific segments from TMX files selected at random from the Cofion Techiaith Cymru website, which have been pre-classified as pertaining to the specific domain. Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.
A website demonstrating use of this model is available at http://cyfieithu.techiaith.cymru.
Evaluation
Evaluation was done using the python libraries SacreBLEU and torchmetrics.
Usage
Ensure you have the prerequisite python libraries installed:
pip install transformers sentencepiece
import trnasformers
model_id = "techiaith/mt-spec-health-en-cy"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate("The doctor had many patients to attend to this morning.")
print(translated["translation_text"])