SaranaAbidueva commited on
Commit
5061723
1 Parent(s): 3251f98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -1,3 +1,52 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - SaranaAbidueva/buryat-russian_parallel_corpus
5
+ language:
6
+ - ru
7
+ metrics:
8
+ - bleu
9
  ---
10
+ This is NLLB-200 trained on buryat-russian language pairs. It translates from buryat to russian and vice-versa.
11
+ BLEU bxr-ru: 20, ru-bxr:13
12
+
13
+ Thanks to https://huggingface.co/slone/nllb-rus-tyv-v1 tutorial
14
+
15
+ ```python
16
+ !pip install sentencepiece transformers==4.33
17
+ from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig
18
+ def fix_tokenizer(tokenizer, new_lang='bxr_Cyrl'):
19
+ """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
20
+ old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
21
+ tokenizer.lang_code_to_id[new_lang] = old_len-1
22
+ tokenizer.id_to_lang_code[old_len-1] = new_lang
23
+ # always move "mask" to the last position
24
+ tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
25
+
26
+ tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
27
+ tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
28
+ if new_lang not in tokenizer._additional_special_tokens:
29
+ tokenizer._additional_special_tokens.append(new_lang)
30
+ # clear the added token encoder; otherwise a new token may end up there by mistake
31
+ tokenizer.added_tokens_encoder = {}
32
+ tokenizer.added_tokens_decoder = {}
33
+ MODEL_URL = "SaranaAbidueva/nllb-200-bxr-ru"
34
+ model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
35
+ tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True)
36
+ fix_tokenizer(tokenizer)
37
+
38
+ def translate(text, src_lang='rus_Cyrl', tgt_lang='bxr_Cyrl', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
39
+ tokenizer.src_lang = src_lang
40
+ tokenizer.tgt_lang = tgt_lang
41
+ inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
42
+ result = model.generate(
43
+ **inputs.to(model.device),
44
+ forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
45
+ max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
46
+ num_beams=num_beams,
47
+ **kwargs
48
+ )
49
+ return tokenizer.batch_decode(result, skip_special_tokens=True)
50
+
51
+ translate("красная птица", src_lang='rus_Cyrl', tgt_lang='bxr_Cyrl')
52
+ ```