File size: 4,196 Bytes
4a0f560
 
 
 
d1a532e
4a0f560
 
 
d1a532e
 
 
 
 
9e26267
4a0f560
 
 
 
d1a532e
 
 
 
 
 
4a0f560
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1a532e
4a0f560
 
 
 
 
 
d1a532e
13a780c
d1a532e
 
13a780c
 
 
 
 
 
 
d1a532e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e26267
4a0f560
d1a532e
4a0f560
d1a532e
4a0f560
d1a532e
4a0f560
d1a532e
4a0f560
d1a532e
4a0f560
d1a532e
4a0f560
 
 
 
 
 
 
 
 
 
d1a532e
4a0f560
d1a532e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- mteb
- sentence-transformers
- feature-extraction
- sentence-similarity
license: mit
language:
- multilingual
- en
- sr
base_model: intfloat/multilingual-e5-small
---

# djovak/embedic-small

Say hello to **Embedić**, a group of new text embedding models finetuned for the Serbian language!

These models are particularly useful in Information Retrieval and RAG purposes. Check out images showcasing benchmark performance, you can beat previous SOTA with 5x fewer parameters!

Although specialized for Serbian(Cyrillic and Latin scripts), Embedić is Cross-lingual(it understands English too). So you can embed English docs, Serbian docs, or a combination of the two :)

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

<!--- Describe your model here -->

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["ko je Nikola Tesla?", "Nikola Tesla je poznati pronalazač", "Nikola Jokić je poznati košarkaš"]

model = SentenceTransformer('djovak/embedic-small')
embeddings = model.encode(sentences)
print(embeddings)
```

### Important usage notes
- "ošišana latinica" (usage of c instead of ć, etc...) significantly deacreases search quality
- The usage of uppercase letters for named entities can significantly improve search quality

## Training

- Embedić models are fine-tuned from multilingual-e5 models and they come in 3 sizes (small, base, large).

- Training is done on a single 4070ti super GPU

- 3-step training: distillation, training on (query, text) pairs and finally fine-tuning with triplets.

## Evaluation


### **Model description**:   

| Model Name |  Dimension | Sequence Length | Parameters
|:----:|:---:|:---:|:---:|
| [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 384 | 512 | 117M
| [djovak/embedic-small](https://huggingface.co/djovak/embedic-small) |  384 | 512 | 117M
|||||||||
| [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |  768 | 512 | 278M 
| [djovak/embedic-base](https://huggingface.co/djovak/embedic-base) |  768 | 512 | 278M
|||||||||
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) |  1024 | 512 |  560M
| [djovak/embedic-large](https://huggingface.co/djovak/embedic-large) |  1024 | 512 | 560M



`BM25-ENG` - Elasticsearch with English analyzer 


`BM25-SRB` - Elasticsearch with Serbian analyzer

### evaluation results

Evaluation on 3 tasks: Information Retrieval, Sentence Similarity, and Bitext mining. I personally translated the STS17 cross-lingual evaluation dataset and Spent 6,000$ on Google translate API, translating 4 IR evaluation datasets into Serbian language.

Evaluation datasets will be published as Part of [MTEB benchmark](https://huggingface.co/spaces/mteb/leaderboard) in the near future.

![information retrieval results](image-2.png)

![sentence similarity results](image-1.png)

## Contact

If you have any question or sugestion related to this project, you can open an issue or pull request. You can also email me at novakzivanic@gmail.com

## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## License

Embedić models are licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.