sdadas commited on
Commit
1b9bc74
·
verified ·
1 Parent(s): 15b8cd7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -3
README.md CHANGED
@@ -1,3 +1,92 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ language: pl
9
+ license: mit
10
+ widget:
11
+ - source_sentence: "zapytanie: Jak dożyć 100 lat?"
12
+ sentences:
13
+ - "Trzeba zdrowo się odżywiać i uprawiać sport."
14
+ - "Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
15
+ - "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
16
+
17
+ ---
18
+
19
+ <h1 align="center">Stella-PL-retrieval</h1>
20
+
21
+ This is a text encoder based on [stella_en_1.5B_v5](https://huggingface.co/dunzhang/stella_en_1.5B_v5) and further fine-tuned for Polish information retrieval tasks.
22
+ - In the first step, we adapted the model for Polish with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) using a diverse corpus of 20 million Polish-English text pairs.
23
+ - The second step involved fine-tuning the model with contrastrive loss using a dataset consisting of 1.4 million queries. Positive and negative passages for each query have been selected with the help of [BAAI/bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight) reranker. The model was trained for three epochs with a batch size of 1024 queries.
24
+
25
+ The encoder transforms texts to 1024 dimensional vectors. The model is optimized specifically for Polish information retrieval tasks. If you need a more versatile encoder, suitable for a wider range of tasks such as semantic similarity or clustering, you probably use the distilled version from the first step: [sdadas/stella-pl](https://huggingface.co/sdadas/stella-pl).
26
+
27
+ ## Usage (Sentence-Transformers)
28
+
29
+ The model utilizes the same prompts as the original [stella_en_1.5B_v5](https://huggingface.co/dunzhang/stella_en_1.5B_v5).
30
+
31
+ For retrieval, queries should be prefixed with **"Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "**.
32
+
33
+ For symmetric tasks such as semantic similarity, both texts should be prefixed with **"Instruct: Retrieve semantically similar text.\nQuery: "**.
34
+
35
+ Please note that the model uses a custom implementation, so you should add `trust_remote_code=True` argument when loading it.
36
+ It is also recommended to use Flash Attention 2, which can be enabled with `attn_implementation` argument.
37
+ You can use the model like this with [sentence-transformers](https://www.SBERT.net):
38
+
39
+ ```python
40
+ from sentence_transformers import SentenceTransformer
41
+ from sentence_transformers.util import cos_sim
42
+
43
+ model = SentenceTransformer(
44
+ "sdadas/stella-pl-retrieval",
45
+ trust_remote_code=True,
46
+ device="cuda",
47
+ model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
48
+ )
49
+ model.bfloat16()
50
+
51
+ # Retrieval example
52
+ query_prefix = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
53
+ queries = [query_prefix + "Jak dożyć 100 lat?"]
54
+ answers = [
55
+ "Trzeba zdrowo się odżywiać i uprawiać sport.",
56
+ "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
57
+ "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
58
+ ]
59
+ queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
60
+ answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
61
+ best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
62
+ print(answers[best_answer])
63
+
64
+ # Semantic similarity example
65
+ sim_prefix = "Instruct: Retrieve semantically similar text.\nQuery: "
66
+ sentences = [
67
+ sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
68
+ sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
69
+ sim_prefix + "One should eat healthy and engage in sports.",
70
+ sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
71
+ ]
72
+ emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
73
+ print(cos_sim(emb, emb))
74
+
75
+ ```
76
+
77
+ ## Evaluation Results
78
+
79
+ The model achieves **NDCG@10** of **62.32** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
80
+
81
+ ## Citation
82
+
83
+ ```bibtex
84
+ @article{dadas2024pirb,
85
+ title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
86
+ author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
87
+ year={2024},
88
+ eprint={2402.13350},
89
+ archivePrefix={arXiv},
90
+ primaryClass={cs.CL}
91
+ }
92
+ ```