sdadas commited on
Commit
b70eea4
1 Parent(s): bc06dc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md CHANGED
@@ -1,3 +1,58 @@
1
  ---
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ language: pl
9
  license: apache-2.0
10
+ widget:
11
+ - source_sentence: "query: Jak dożyć 100 lat?"
12
+ sentences:
13
+ - "passage: Trzeba zdrowo się odżywiać i uprawiać sport."
14
+ - "passage: Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
15
+ - "passage: Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
16
+
17
  ---
18
+
19
+ <h1 align="center">MMLW-e5-large</h1>
20
+
21
+ MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
22
+ This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning.
23
+ It transforms texts to 1024 dimensional vectors.
24
+ The model was initialized with multilingual E5 checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation.
25
+
26
+ ## Usage (Sentence-Transformers)
27
+
28
+ ⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, queries should be prefixed with **"query: "** and passages with **"passage: "** ⚠️
29
+
30
+ You can use the model like this with [sentence-transformers](https://www.SBERT.net):
31
+
32
+ ```python
33
+ from sentence_transformers import SentenceTransformer
34
+ from sentence_transformers.util import cos_sim
35
+
36
+ query_prefix = "query: "
37
+ answer_prefix = "passage: "
38
+ queries = [query_prefix + "Jak dożyć 100 lat?"]
39
+ answers = [
40
+ answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
41
+ answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
42
+ answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
43
+ ]
44
+ model = SentenceTransformer("sdadas/mmlw-e5-large")
45
+ queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
46
+ answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
47
+
48
+ best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
49
+ print(answers[best_answer])
50
+ # Trzeba zdrowo się odżywiać i uprawiać sport.
51
+ ```
52
+
53
+ ## Evaluation Results
54
+
55
+ The model achieves **NDCG@10** of **56.09** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
56
+
57
+ ## Acknowledgements
58
+ This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.