Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,58 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
tags:
|
4 |
+
- sentence-transformers
|
5 |
+
- feature-extraction
|
6 |
+
- sentence-similarity
|
7 |
+
- transformers
|
8 |
+
language: pl
|
9 |
license: apache-2.0
|
10 |
+
widget:
|
11 |
+
- source_sentence: "query: Jak dożyć 100 lat?"
|
12 |
+
sentences:
|
13 |
+
- "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
|
14 |
+
- "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
|
15 |
+
- "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
|
16 |
+
|
17 |
---
|
18 |
+
|
19 |
+
<h1 align="center">MMLW-e5-large</h1>
|
20 |
+
|
21 |
+
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
|
22 |
+
This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning.
|
23 |
+
It transforms texts to 1024 dimensional vectors.
|
24 |
+
The model was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation.
|
25 |
+
|
26 |
+
## Usage (Sentence-Transformers)
|
27 |
+
|
28 |
+
⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️
|
29 |
+
|
30 |
+
You can use the model like this with [sentence-transformers](https://www.SBERT.net):
|
31 |
+
|
32 |
+
```python
|
33 |
+
from sentence_transformers import SentenceTransformer
|
34 |
+
from sentence_transformers.util import cos_sim
|
35 |
+
|
36 |
+
query_prefix = "query: "
|
37 |
+
answer_prefix = "passage: "
|
38 |
+
queries = [query_prefix + "Jak dożyć 100 lat?"]
|
39 |
+
answers = [
|
40 |
+
answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
|
41 |
+
answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
|
42 |
+
answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
|
43 |
+
]
|
44 |
+
model = SentenceTransformer("sdadas/mmlw-e5-large")
|
45 |
+
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
|
46 |
+
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
|
47 |
+
|
48 |
+
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
|
49 |
+
print(answers[best_answer])
|
50 |
+
# Trzeba zdrowo się odżywiać i uprawiać sport.
|
51 |
+
```
|
52 |
+
|
53 |
+
## Evaluation Results
|
54 |
+
|
55 |
+
The model achieves **NDCG@10** of **56.09** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
|
56 |
+
|
57 |
+
## Acknowledgements
|
58 |
+
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.
|