Added dummy dataset element to model-index to silence huggingface_hub model_info API warnings.

1e5a4d2 verified 8 months ago

2.57 kB

	---
	language:
	- en
	- cy
	pipeline_tag: translation
	tags:
	- translation
	- marian
	metrics:
	- bleu
	- cer
	- wer
	- wil
	- wip
	- chrf
	widget:
	- text: "The doctor will be late to attend to patients this morning."
	example_title: "Example 1"
	license: apache-2.0
	model-index:
	- name: "mt-dspec-health-en-cy"
	results:
	- task:
	name: Translation
	type: translation
	dataset:
	type: "text"
	name: "various"
	metrics:
	- name: SacreBLEU
	type: bleu
	value: 54.16
	- name: CER
	type: cer
	value: 0.31
	- name: WER
	type: wer
	value: 0.47
	- name: WIL
	type: wil
	value: 0.67
	- name: WIP
	type: wip
	value: 0.33
	- name: SacreBLEU CHRF
	type: chrf
	value: 69.03
	---

	# mt-dspec-health-en-cy
	A language translation model for translating between English and Welsh, specialised to the specific domain of Health and care.

	This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/),
	the datasets prepared were generated from the following sources:
	- [UK Government Legislation data](https://www.legislation.gov.uk)
	- [OPUS-cy-en](https://opus.nlpl.eu/)
	- [Cofnod Y Cynulliad](https://record.assembly.wales/)
	- [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

	The data was split into train, validation and tests sets, the test set containing health-specific segments from TMX files
	selected at random from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain.
	Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.

	A website demonstrating use of this model is available at http://cyfieithu.techiaith.cymru.

	## Evaluation

	Evaluation was done using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).

	## Usage

	Ensure you have the prerequisite python libraries installed:

	```bash
	pip install transformers sentencepiece
	```

	```python
	import trnasformers
	model_id = "techiaith/mt-spec-health-en-cy"
	tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
	model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
	translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
	translated = translate("The doctor will be late to attend to patients this morning.")
	print(translated["translation_text"])
	```