File size: 3,482 Bytes
2fc7c63
 
7974d00
 
 
 
 
f27193e
 
 
 
 
2fc7c63
 
7974d00
 
 
2fc7c63
9bdcc20
 
 
 
 
 
 
 
 
 
2fc7c63
 
 
 
7974d00
 
 
2fc7c63
 
 
7974d00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd0b45a
7974d00
2fc7c63
 
 
 
 
7974d00
 
 
2fc7c63
 
 
7974d00
 
 
2fc7c63
7974d00
2fc7c63
 
 
7974d00
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: transformers
license: cc-by-nc-4.0
language:
- de
- frr
base_model: facebook/nllb-200-distilled-600M
widget:
- text: "Momme booget önj Naibel."
  example_title: "Example with names"
- text: "Et wus mån en däiken stroote ful foon däike manschne."
  example_title: "Longer example"
---

# Model Card for nllb-deu-moo-v2
This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model fine-tuned for translating between 
German and the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865).

## Limitations

This model should be considered no more than a demo.
The dataset used for fine-tuning is relatively small and has been constructed from multiple texts by the same author.
On top of that, the texts are relatively old and are set in the 19th century and earlier.
As a result the Frisian vocabulary the model has learned is highly limited, especially when it comes to more modern words and phrases.

In a separate issue, while the model can translate German to Frisian and Frisian to any language supported by the base model, 
it cannot translate any language other than German to Frisian. The result will be in German instead. The reason for this is yet unknown.

## Model Details

### Model Description

- **Language(s) (NLP):** Northern Frisian, German
- **License:** Commons Attribution Non Commercial 4.0
- **Finetuned from model:** NLLB-200-600M


## How to Get Started with the Model
How to use the model:
```python
!pip install transformers>=4.38

tokenizer = NllbTokenizer.from_pretrained("CmdCody/nllb-deu-moo-v2")
model = AutoModelForSeq2SeqLM.from_pretrained("CmdCody/nllb-deu-moo-v2")
model.cuda()

def translate(text, tokenizer, model, src_lang='frr_Latn', tgt_lang='deu_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

translate("Momme booget önj Naibel." tokenizer=tokenizer, model=model)
```

## Training Details

### Training Data

The training data consists of 
["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf) 
published by the Nordfriisk Instituut. It was split and cleaned up, partially manually, resulting in 5178 example sentences.

### Training Procedure 

The training loop was implemented as described in [this article](https://cointegrated.medium.com/a37fc706b865).
The model was trained for 5 epochs of 1000 steps each using a batch size of 16 using a Google GPU via a Colab notebook.
Each epoch took roughly 30 minutes to train.

The BLEU score was calculated on a set of 177 sentences taken from other sources.

#### Metrics

| Epochs  | Steps  | BLEU Score frr -> de  | BLEU Score de -> frr |
|---------|--------|-----------------------|----------------------|
| 1       | 1000   | 35.86  | 35.68  |
| 2       | 2000   | 40.76  | 42.25  |
| 3       | 3000   | 42.18  | 46.48  |
| 4       | 4000   | 41.01  | 45.15  |
| 5       | 5000   | 44.74  | 47.48  |