File size: 2,858 Bytes
a13bdda
aafc0f6
 
 
a13bdda
092fdd2
aafc0f6
 
 
 
a9974cf
 
 
 
 
 
 
49b92b1
a9974cf
 
 
 
 
0e2e889
 
 
 
 
 
6897bcf
 
 
0e2e889
a9974cf
 
 
 
 
 
 
 
 
 
 
 
a13bdda
0e2e889
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ff1621
 
 
 
 
0e2e889
 
 
 
 
 
 
 
 
 
 
 
 
 
a9974cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
language:
- en
- cy
license: apache-2.0
pipeline_tag: translation
tags:
- translation
- marian
metrics:
- bleu
- cer
- chrf
- cer
- wer
- wil
- wip
widget:
- text: >-
    The Curriculum and Assessment (Wales) Act 2021 (the Act) established the
    Curriculum for Wales and replaced the general curriculum used up until that
    point.
  example_title: Example 1
model-index:
- name: mt-dspec-legislation-en-cy
  results:
  - task:
      name: Translation
      type: translation
    dataset:
      name: "various"
      type: "text"
    metrics:
    - type: bleu
      value: 65.51
    - type: cer
      value: 0.28
    - type: chrf
      value: 74.69
    - type: wer
      value: 0.39
    - type: wil
      value: 0.54
    - type: wip
      value: 0.46
---
# mt-dspec-legislation-en-cy
A language translation model for translating between English and Welsh, specialised to the specific domain of Legislation.

This model was trained using custom DVC pipeline employing [Marian NMT](https://marian-nmt.github.io/), 
the datasets prepared were generated from the following sources:
 - [UK Government Legislation data](https://www.legislation.gov.uk)
 - [OPUS-cy-en](https://opus.nlpl.eu/)
 - [Cofnod Y Cynulliad](https://record.assembly.wales/)
 - [Cofion Techiaith Cymru](https://cofion.techiaith.cymru)

The data was split into train, validation and test sets; the test set containing legislation-specific segments were selected randomly from TMX files
originating from the [Cofion Techiaith Cymru](https://cofion.techiaith.cymru) website, which have been pre-classified as pertaining to the specific domain,
and data files scraped from the UK Government Legislation website.

Having extracted the test set, the aggregation of remaining data was then split into 10 training and validation sets, and fed into 10 marian training sessions.

## Evaluation

Evaluation scores were produced using the python libraries [SacreBLEU](https://github.com/mjpost/sacrebleu) and [torchmetrics](https://torchmetrics.readthedocs.io/en/stable/).

## Usage

Ensure you have the prerequisite python libraries installed:

```bash
# The constraint imposed on the transformers version below
# is due to the following issue:
#    https://github.com/huggingface/transformers/issues/26271
pip install sentencepiece "transformers>4.26.1<=4.30.2"
```

```python
import trnasformers
model_id = "techiaith/mt-spec-health-en-cy"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_id)
translate = transformers.pipeline("translation", model=model, tokenizer=tokenizer)
translated = translate(
  "The Curriculum and Assessment (Wales) Act 2021 (the Act) "
  "established the Curriculum for Wales and replaced the general "
  "curriculum used up until that point."
)
print(translated["translation_text"])
```