File size: 2,457 Bytes
ec55ad6
 
 
 
 
 
 
 
 
 
 
c45d8c4
 
ec55ad6
 
d4fb077
ec55ad6
 
a6fafec
 
e7b94cb
9e413d2
ec55ad6
 
 
 
36d726e
 
ec55ad6
 
 
 
 
 
3026987
ec55ad6
5c3969c
 
e7b94cb
 
ec55ad6
 
 
 
 
 
 
 
 
 
 
 
fd5622a
 
ec55ad6
 
 
 
 
 
 
 
971f876
c45d8c4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apache-2.0
language:
- en
- az
base_model:
- sentence-transformers/LaBSE
pipeline_tag: sentence-similarity
---




# Small LaBSE for English-Azerbaijani

This is an optimized version of [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)





# Benchmark

| STSBenchmark | biosses-sts | sickr-sts | sts12-sts | sts13-sts | sts15-sts | sts16-sts | Average Pearson | Model                                |
|--------------|-------------|-----------|-----------|-----------|-----------|-----------|-----------------|--------------------------------------|
| 0.7363       | 0.8148      | 0.7067    | 0.7050    | 0.6535    | 0.7514    | 0.7070    | 0.7250          | sentence-transformers/LaBSE           |
| 0.7400       | 0.8216      | 0.6946    | 0.7098    | 0.6781    | 0.7637    | 0.7222    | 0.7329          | LocalDoc/LaBSE-small-AZ               |
| 0.5830       | 0.2486      | 0.5921    | 0.5593    | 0.5559    | 0.5404    | 0.5289    | 0.5155          | antoinelouis/colbert-xm                |
| 0.7572       | 0.8139      | 0.7328    | 0.7646    | 0.6318    | 0.7542    | 0.7092    | 0.7377          | intfloat/multilingual-e5-large-instruct |
| 0.7485       | 0.7714      | 0.7271    | 0.7170    | 0.6496    | 0.7570    | 0.7255    | 0.7280          | intfloat/multilingual-e5-large        |
| 0.6960       | 0.8185      | 0.6950    | 0.6752    | 0.5899    | 0.7186    | 0.6790    | 0.6960          | intfloat/multilingual-e5-base         |
| 0.7376       | 0.7917      | 0.7190    | 0.7441    | 0.6286    | 0.7461    | 0.7026    | 0.7242          | intfloat/multilingual-e5-small        |
| 0.7927       | 0.6672      | 0.7758    | 0.8122    | 0.7312    | 0.7831    | 0.7416    | 0.7577          | BAAI/bge-m3                           |

[STS-Benchmark](https://github.com/LocalDoc-Azerbaijan/STS-Benchmark)





## How to Use

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/LaBSE-small-AZ")
model = AutoModel.from_pretrained("LocalDoc/LaBSE-small-AZ")

# Prepare texts
texts = [
    "Hello world",
    "Salam dünya"
]

# Tokenize and generate embeddings
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**encoded).pooler_output

# Compute similarity
similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
```