TwentyNine commited on
Commit
1507674
1 Parent(s): 0ad3f4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -2
README.md CHANGED
@@ -3,9 +3,76 @@ language:
3
  - ja
4
  - ain
5
  pipeline_tag: translation
 
6
  ---
7
 
8
- ## Disclaimer
9
  This model is only an preliminary experimental result and is not suitable for any sort of serious use. This model's capability is at best extremely limited and unreliable.
10
 
11
- That said, look forward to good things to come. This is my debut to the field of Ainu NLP.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  - ja
4
  - ain
5
  pipeline_tag: translation
6
+ license: cc-by-nc-4.0
7
  ---
8
 
9
+ # Disclaimer
10
  This model is only an preliminary experimental result and is not suitable for any sort of serious use. This model's capability is at best extremely limited and unreliable.
11
 
12
+ That said, look forward to good things to come. This is my debut to the field of Ainu NLP.
13
+
14
+ # Acknowledgements
15
+ I am indebted to [Michal Ptaszynski](https://huggingface.co/ptaszynski) for his guidance and encouragement, [Karol Nowakowski](https://huggingface.co/karolnowakowski) for his work to compile an expansive parallel corpus, [David Dale](https://huggingface.co/cointegrated) for his [Medium article](https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865) that helped me to quickly and smoothly take this first step.
16
+
17
+ # How to use this model
18
+ The following is adapted from [slone/nllb-rus-tyv-v1](https://huggingface.co/slone/nllb-rus-tyv-v1).
19
+
20
+ ```Python
21
+ # the version of transformers is important!
22
+ !pip install sentencepiece transformers==4.33
23
+ import torch
24
+ from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
25
+
26
+ def fix_tokenizer(tokenizer, new_lang='ain_Latn'):
27
+ """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
28
+ old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
29
+ tokenizer.lang_code_to_id[new_lang] = old_len-1
30
+ tokenizer.id_to_lang_code[old_len-1] = new_lang
31
+ # always move "mask" to the last position
32
+ tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
33
+
34
+ tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
35
+ tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
36
+ if new_lang not in tokenizer._additional_special_tokens:
37
+ tokenizer._additional_special_tokens.append(new_lang)
38
+ # clear the added token encoder; otherwise a new token may end up there by mistake
39
+ tokenizer.added_tokens_encoder = {}
40
+ tokenizer.added_tokens_decoder = {}
41
+
42
+ MODEL_URL = "TwentyNine/nllb-jpn-ain-v1"
43
+ model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
44
+ tokenizer = NllbTokenizer.from_pretrained(MODEL_URL)
45
+ fix_tokenizer(tokenizer)
46
+
47
+ def translate(
48
+ text,
49
+ model,
50
+ tokenizer,
51
+ src_lang='jpn_Jpan',
52
+ tgt_lang='ain_Latn',
53
+ max_length='auto',
54
+ num_beams=4,
55
+ n_out=None,
56
+ **kwargs
57
+ ):
58
+ tokenizer.src_lang = src_lang
59
+ encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
60
+ if max_length == 'auto':
61
+ max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
62
+ model.eval()
63
+ generated_tokens = model.generate(
64
+ **encoded.to(model.device),
65
+ forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
66
+ max_length=max_length,
67
+ num_beams=num_beams,
68
+ num_return_sequences=n_out or 1,
69
+ **kwargs
70
+ )
71
+ out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
72
+ if isinstance(text, str) and n_out is None:
73
+ return out[0]
74
+ return
75
+
76
+ translate("肉が食べたいな。", model=model, tokenizer=tokenizer)
77
+ # 'kam c=e rusuy na.'
78
+ ```