cointegrated commited on
Commit
d34f966
1 Parent(s): 680da70

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ - myv
5
+ tags:
6
+ - erzya
7
+ - mordovian
8
+ - fill-mask
9
+ - pretraining
10
+ - embeddings
11
+ - masked-lm
12
+ - feature-extraction
13
+ - sentence-similarity
14
+ license: cc-by-sa-4.0
15
+ datasets:
16
+ - slone/myv_ru_2022
17
+ ---
18
+
19
+ This is an Erzya (`myv`, cyrillic script) sentence encoder from the paper "The first neural machine translation system for the Erzya language".
20
+
21
+ It is based on [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) ([license here](https://tfhub.dev/google/LaBSE/2)), but with updated vocabulary and checkpoint:
22
+ - Removed all tokens except the most popular ones for English or Russian;
23
+ - Added extra tokens for Erzya language;
24
+ - Fine-tuned on the [slone/myv_ru_2022](https://huggingface.co/slone/myv_ru_2022) corpus using a mixture of tasks:
25
+ - Cross-lingual distillation of sentence embeddings from the original LaBSE model, using the parallel `ru-myv` corpus;
26
+ - Masked language modelling on `myv` monolingual data;
27
+ - Sentence pair classification to distinguish correct `ru-myv` translations from random pairs.
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import AutoTokenizer, AutoModel
32
+ tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
33
+ model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
34
+ sentences = ["Hello World", "Привет Мир", "Шумбратадо Мастор"]
35
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
36
+ with torch.no_grad():
37
+ model_output = model(**encoded_input)
38
+ embeddings = model_output.pooler_output
39
+ embeddings = torch.nn.functional.normalize(embeddings)
40
+ print(embeddings.shape) # torch.Size([3, 768])
41
+ ```
42
+
43
+ The model can be used as a sentence encoder or fine-tuned for any downstream NLU dask.