cointegrated commited on
Commit
35d84ef
1 Parent(s): 46bd8a2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - AigizK/bashkir-russian-parallel-corpora
5
+ language:
6
+ - ba
7
+ - ru
8
+ pipeline_tag: text-classification
9
+ ---
10
+
11
+ This is a text pair classifier, trained to predict whether a Bashkir sentence and a Russian sentence have the same meaning.
12
+
13
+ It can be used for filtering parallel corpora or evaluating machine translation quality.
14
+
15
+ It can be applied to predict scores like this:
16
+
17
+ ```Python
18
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
19
+ import torch
20
+
21
+ clf_name = 'slone/bert-base-multilingual-cased-bak-rus-similarity'
22
+ clf = AutoModelForSequenceClassification.from_pretrained(clf_name)
23
+ clf_tokenizer = AutoTokenizer.from_pretrained(clf_name)
24
+
25
+ def classify(texts_ba, texts_ru):
26
+ with torch.inference_mode():
27
+ batch = clf_tokenizer(texts_ba, texts_ru, padding=True, truncation=True, max_length=512, return_tensors='pt').to(clf.device)
28
+ return torch.softmax(clf(**batch).logits.view(-1, 2), -1)[:, 1].cpu().numpy()
29
+
30
+ print(classify(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.'], ['Привет, мир!', 'Мама мыла раму.']))
31
+ # [0.96345973 0.02213471]
32
+ ```
33
+