mgrbyte's picture
Added dummy dataset element to model-index to silence huggingface_hub model_info API warnings.
6897bcf verified
|
raw
history blame
2.69 kB
metadata
language:
  - en
  - cy
license: apache-2.0
pipeline_tag: translation
tags:
  - translation
  - marian
metrics:
  - bleu
  - cer
  - chrf
  - cer
  - wer
  - wil
  - wip
widget:
  - text: >-
      The Curriculum and Assessment (Wales) Act 2021 (the Act) established the
      Curriculum for Wales and replaced the general curriculum used up until
      that point.
    example_title: Example 1
model-index:
  - name: mt-dspec-legislation-en-cy
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: various
          type: text
        metrics:
          - type: bleu
            value: 65.51
          - type: cer
            value: 0.28
          - type: chrf
            value: 74.69
          - type: wer
            value: 0.39
          - type: wil
            value: 0.54
          - type: wip
            value: 0.46

mt-dspec-legislation-en-cy

A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.

This model was trained using custom DVC pipeline employing Marian NMT, the datasets prepared were generated from the following sources:

The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files originating from the Cofion Techiaith Cymru website, which have been pre-classified as pertaining to the specific domain, and data files scraped from the UK Government Legislation website.

Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.

Evaluation

Evaluation scores were produced using the python libraries SacreBLEU and torchmetrics.

Usage

Ensure you have the prerequisite python libraries installed:

pip install transformers sentencepiece
import trnasformers
model_id = "techiaith/mt-spec-health-en-cy"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
  "The Curriculum and Assessment (Wales) Act 2021 (the Act) "
  "established the Curriculum for Wales and replaced the general "
  "curriculum used up until that point."
)
print(translated["translation_text"])