PM-AI commited on
Commit
ff67acf
1 Parent(s): 65ab0e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -110
README.md CHANGED
@@ -1,128 +1,224 @@
1
  ---
 
 
 
2
  pipeline_tag: sentence-similarity
3
  tags:
 
 
 
 
 
 
 
 
 
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
-
 
 
 
 
 
 
9
  ---
10
 
11
- # PM-AI/test_bug_temporary
12
-
13
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
-
15
- <!--- Describe your model here -->
16
-
17
- ## Usage (Sentence-Transformers)
18
-
19
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
-
21
- ```
22
- pip install -U sentence-transformers
23
- ```
24
-
25
- Then you can use the model like this:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ```python
28
- from sentence_transformers import SentenceTransformer
29
- sentences = ["This is an example sentence", "Each sentence is converted"]
30
-
31
- model = SentenceTransformer('PM-AI/test_bug_temporary')
32
- embeddings = model.encode(sentences)
33
- print(embeddings)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ```
35
 
36
-
37
-
38
- ## Usage (HuggingFace Transformers)
39
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
40
-
41
- ```python
42
- from transformers import AutoTokenizer, AutoModel
43
- import torch
44
-
45
-
46
- #Mean Pooling - Take attention mask into account for correct averaging
47
- def mean_pooling(model_output, attention_mask):
48
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
49
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
50
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
51
-
52
-
53
- # Sentences we want sentence embeddings for
54
- sentences = ['This is an example sentence', 'Each sentence is converted']
55
-
56
- # Load model from HuggingFace Hub
57
- tokenizer = AutoTokenizer.from_pretrained('PM-AI/test_bug_temporary')
58
- model = AutoModel.from_pretrained('PM-AI/test_bug_temporary')
59
-
60
- # Tokenize sentences
61
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
62
-
63
- # Compute token embeddings
64
- with torch.no_grad():
65
- model_output = model(**encoded_input)
66
-
67
- # Perform pooling. In this case, mean pooling.
68
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
69
-
70
- print("Sentence embeddings:")
71
- print(sentence_embeddings)
72
- ```
73
-
74
-
75
-
76
- ## Evaluation Results
77
-
78
- <!--- Describe how your model was evaluated -->
79
-
80
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=PM-AI/test_bug_temporary)
81
-
82
-
83
  ## Training
84
- The model was trained with the parameters:
85
-
86
- **DataLoader**:
87
-
88
- `torch.utils.data.dataloader.DataLoader` of length 6706 with parameters:
89
- ```
90
- {'batch_size': 75, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
91
- ```
92
-
93
- **Loss**:
94
-
95
- `beir.losses.margin_mse_loss.MarginMSELoss`
96
-
97
- Parameters of the fit()-Method:
98
- ```
99
- {
100
- "epochs": 10,
101
- "evaluation_steps": 10000,
102
- "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
103
- "max_grad_norm": 1,
104
- "optimizer_class": "<class 'transformers.optimization.AdamW'>",
105
- "optimizer_params": {
106
- "correct_bias": false,
107
- "eps": 1e-06,
108
- "lr": 2e-05
109
- },
110
- "scheduler": "WarmupLinear",
111
- "steps_per_epoch": null,
112
- "warmup_steps": 1000,
113
- "weight_decay": 0.01
114
- }
115
- ```
116
 
 
 
 
117
 
118
- ## Full Model Architecture
 
 
119
  ```
120
- SentenceTransformer(
121
- (0): Transformer({'max_seq_length': 350, 'do_lower_case': False}) with Transformer model: BertModel
122
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
123
- )
124
- ```
125
-
126
- ## Citing & Authors
127
 
128
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: de
3
+ datasets:
4
+ - germandpr-beir
5
  pipeline_tag: sentence-similarity
6
  tags:
7
+ - information retrieval
8
+ - ir
9
+ - documents retrieval
10
+ - passage retrieval
11
+ - beir
12
+ - benchmark
13
+ - qrel
14
+ - sts
15
+ - semantic search
16
  - sentence-transformers
17
  - feature-extraction
18
  - sentence-similarity
19
  - transformers
20
+ task_categories:
21
+ - sentence-similarity
22
+ - feature-extraction
23
+ - text-retrieval
24
+ - other
25
+ task_ids:
26
+ - document-retrieval
27
  ---
28
 
29
+ # Model card for PM-AI/bi-encoder_msmarco_bert-base_german
30
+
31
+ ## Model summary
32
+ This models can be used for **semantic search** and **documents retrieval** to find relevant passages based on a query.
33
+ It was trained on a machine translated **MSMARCO dataset** for _german_ with **hard negatives** and **Margin MSE loss**.
34
+ Combining these elements results in a SOTA transformer for asymmetric search.
35
+ Details are presented below.
36
+
37
+ The model can be easily used with [Sentence Transformer](https://github.com/UKPLab/sentence-transformers) library.
38
+
39
+ tl;dr ... [go to evaluation results first](#evaluation)
40
+
41
+ ## Training Data
42
+ The model is based on training with samples from **[MSMARCO Passage Ranking](https://microsoft.github.io/msmarco/#ranking)** dataset.
43
+ It contains about 500.000 questions and 8.8 million passages.
44
+ The training objective is to identify the relevant passages or answers for an input question.
45
+ In terms of content, the texts deal with diverse domains.
46
+ Questions are available as sentences but also keyword-based variants can be found.
47
+ Consequently, models trained on MSMARCO can be used in a variety of domains.
48
+
49
+ The dataset was originally published in English, but has been translated into other languages by researchers with the help of machine translation.
50
+ To be more specific, **"[mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset](https://arxiv.org/abs/2108.13897)"** is used, which contains 13 Google based translations, German is one of them.
51
+
52
+ An existing script from the [BEIR framework](https://openreview.net/forum?id=wCu6T5xFjeJ) was used for the training - more details will follow later.
53
+ This script requires a certain structure for parsing the training data, which is not fulfilled by [unicamp-dl/mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco).
54
+ UKP Lab (TU Darmstadt) created an appropriately [processed mmarco](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/) version, that cannot be used, because it contains outdated texts from an older version of unicamp-dl/mmarco (it us using a MarianNMT-based translation instead of Google)
55
+ Since the textual quality of the older version is poorer, a workaround is necessary in order to be able to use the training data translated by Google.
56
+
57
+ BEIR requires the following structure for the training data when using the `GenericDataLoader`:
58
+ - `corpus.jsonl`: contains one JSON string per line with `_id`, `title` and `text`.
59
+ - Example: `{"_id": "1234", "title": "", "text": "some text"}`
60
+ - `queries.jsonl` an `_id` and a `text` is required per JSON string per line.
61
+ - Example: `{"_id": "5678", "text": "a question?"}`
62
+ - `qrels/dev.tsv`: represents the relation between question (`query-id`) and correct answer (`corpus-id`). The `score` column is mandatory, but always 1
63
+ - Example: `1234 5678 1`
64
+ - `qrels/train.tsv`: Structure is identical to `dev.tsv`
65
+
66
+ Note: Instead of using `GenericDataLoader`, it is also possible to use `HFDataLoader`.
67
+ In this case, a [Huggingface dataset](https://huggingface.co/docs/datasets/index) is loaded directly, i.e. no individual files have to be created manually.
68
+ Nevertheless, this approach also requires a specific structure.
69
+ Two dataset repositories are needed: one for `queries` and `corpus` and another for `qrels`.
70
+ In addition, specific subset names must be defined.
71
+ Overall, the effort is more extensive, because new datasets have to be created (and uploaded to Huggingface Datasets).
72
+ The variant presented here uses existing datasets that are only minimally adapted and thus offer maximum compatibility.
73
+
74
+ The custom-made script [mmarco_beir.py](https://huggingface.co/PM-AI/bi-encoder_msmarco_bert-base_german/blob/main/mmarco_beir.py) contains all necessary adaptations for BEIR compatibility.
75
+ It can be applied to all 14 languages of the mmarco dataset so that corresponding models can be trained comfortably.
76
 
77
  ```python
78
+ # mmarco_beir.py
79
+
80
+ import json
81
+ import os
82
+ import urllib.request
83
+
84
+ import datasets
85
+
86
+ # see https://huggingface.co/datasets/unicamp-dl/mmarco for supported languages
87
+ LANGUAGE = "german"
88
+ # target directory containin BEIR (https://github.com/beir-cellar/beir) compatible files
89
+ OUT_DIR = f"mmarco-google/{LANGUAGE}/"
90
+
91
+ os.makedirs(OUT_DIR, exist_ok=True)
92
+
93
+ # download google based collection/corpus translation of msmarco and write corpus.jsonl for BEIR compatibility
94
+ mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"collection-{LANGUAGE}")
95
+ with open(os.path.join(OUT_DIR, "corpus.jsonl"), "w", encoding="utf-8") as out_file:
96
+ for entry in mmarco_ds["collection"]:
97
+ entry = {"_id": str(entry["id"]), "title": "", "text": entry["text"]}
98
+ out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')
99
+
100
+ # # download google based queries translation of msmarco and write queries.jsonl for BEIR compatibility
101
+ mmarco_ds = datasets.load_dataset("unicamp-dl/mmarco", f"queries-{LANGUAGE}")
102
+ mmarco_ds = datasets.concatenate_datasets([mmarco_ds["train"], mmarco_ds["dev.full"]])
103
+ with open(os.path.join(OUT_DIR, "queries.jsonl"), "w", encoding="utf-8") as out_file:
104
+ for entry in mmarco_ds:
105
+ entry = {"_id": str(entry["id"]), "text": entry["text"]}
106
+ out_file.write(f'{json.dumps(entry, ensure_ascii=False)}\n')
107
+
108
+ QRELS_DIR = os.path.abspath(os.path.join(OUT_DIR, "../qrels/"))
109
+ os.makedirs(QRELS_DIR, exist_ok=True)
110
+
111
+ # download qrels from URL instead of HF dataset
112
+ # note: qrels are language independent
113
+ for link in ["https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/dev.tsv",
114
+ "https://huggingface.co/datasets/BeIR/msmarco-qrels/resolve/main/train.tsv"]:
115
+ urllib.request.urlretrieve(link, os.path.join(QRELS_DIR, os.path.basename(link)))
116
  ```
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ## Training
119
+ The training is run using the **[BEIR Benchmark Framework](https://github.com/beir-cellar/beir)**.
120
+ It is mainly used to create benchmarks for information retrieval.
121
+ In addition, there are some training scripts that generate SOTA models.
122
+
123
+ The approach of training the MSMARCO dataset with the Margin MSE loss method is particularly promising.
124
+ For this purpose [train_msmarco_v3_margin_MSE.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py) is provided by BEIR:
125
+ The unique feature here are the so-called "hard negatives", which were created by a special approach:
126
+
127
+ We use the MSMARCO Hard Negatives File (Provided by Nils Reimers): https://sbert.net/datasets/msmarco-hard-negatives.jsonl.gz
128
+ Negative passage are hard negative examples, that were mined using different dense embedding, cross-encoder methods and lexical search methods.
129
+ Contains upto 50 negatives for each of the four retrieval systems: [bm25, msmarco-distilbert-base-tas-b, msmarco-MiniLM-L-6-v3, msmarco-distilbert-base-v3]
130
+ Each positive and negative passage comes with a score from a Cross-Encoder (msmarco-MiniLM-L-6-v3). This allows denoising, i.e. removing false negative
131
+ passages that are actually relevant for the query.
132
+ [Source](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py])
133
+
134
+ MarginMSELoss is based on the paper of Hofstätter et al. As for MultipleNegativesRankingLoss, we have triplets: (query, passage1, passage2). In contrast to MultipleNegativesRankingLoss, passage1 and passage2 do not have to be strictly positive/negative, both can be relevant or not relevant for a given query.
135
+ We then compute the Cross-Encoder score for (query, passage1) and (query, passage2). We provide scores for 160 million such pairs in our msmarco-hard-negatives dataset. We then compute the distance: CE_distance = CEScore(query, passage1) - CEScore(query, passage2)
136
+ For our bi-encoder training, we encode query, passage1, and passage2 into vector spaces and then measure the dot-product between (query, passage1) and (query, passage2). Again, we measure the distance: BE_distance = DotScore(query, passage1) - DotScore(query, passage2)
137
+ We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between CE_distance and BE_distance.
138
+ An advantage of MarginMSELoss compared to MultipleNegativesRankingLoss is that we don’t require a positive and negative passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the CE_distance will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space.
139
+ And disadvantage of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages.
140
+ [Source](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/README.md)
141
+
142
+ Since the MSMarco dataset has been translated into different languages and the "hard negatives" is only containing the IDs of queries and texts,
143
+ the approach just presented can also be applied to a language other than English.
144
+ The previous section already explained how to create the necessary training data for German.
145
+ The same can be done comfortably for all of the 14 translations.
146
+
147
+ Actually, starting the training process requires one final change to the training script beforehand.
148
+ The following code shows how the dataset path is resolved and passed correctly to the `GenericDataLoader`:
 
 
149
 
150
+ ```python
151
+ import os
152
+ from beir.datasets.data_loader import GenericDataLoader
153
 
154
+ data_path = "./mmarco-google/german"
155
+ qrels_path = os.path.abspath(os.path.join(data_path, "../qrels"))
156
+ corpus, queries, _ = GenericDataLoader(data_folder=data_path, qrels_folder=qrels_path).load(split="train")
157
  ```
 
 
 
 
 
 
 
158
 
159
+ ### Parameterization of training
160
+ - **Script:** [train_msmarco_v3_margin_MSE.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py)
161
+ - **Dataset:** mmarco (compatibility established using [mmarco_beir.py](https://huggingface.co/PM-AI/bi-encoder_msmarco_bert-base_german/blob/main/mmarco_beir.py)), train split
162
+ - **GPU:** NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
163
+ - **Batch Size:** 75
164
+ - **Max. Sequence Length:** 350
165
+ - **Base Model**: [deepset/gbert-base](https://huggingface.co/deepset/gbert-base)
166
+ - **Loss function**: Margin MSE
167
+ - **Epochs**: 10
168
+ - **Evaluation Steps**: 10000
169
+ - **Warmup Steps**: 1000
170
+
171
+ ## Evaluation <a name="evaluation"></a>
172
+ The evaluation is based on **[germanDPR](https://arxiv.org/abs/2104.12741)**.
173
+ The dataset developed by [Deepset.ai](deepset.ai) consists of question-answer pairs, which are supplemented by three "hard negatives" per question.
174
+ This makes it an ideal basis for benchmarking.
175
+ Publicly available is the dataset as **[deepset/germanDPR](https://huggingface.co/datasets/deepset/germandpr)**, which does not support BEIR by default.
176
+ Consequently, this dataset was also reworked manually.
177
+ In addition, duplicate text elements were removed and minimal text adjustments were made.
178
+ The details of this process can be found in **[PM-AI/germandpr-beir](https://huggingface.co/datasets/PM-AI/germandpr-beir)**.
179
+
180
+ The BEIR-compatible germanDPR dataset consists of **9275 questions** with **23993 text passages** for the **train split**.
181
+ In order to have enough text passages for information retrieval, we use the train split and not the test split.
182
+ The following table shows the evaluation results for different approaches and models:
183
+
184
+ **model**|**NDCG@1**|**NDCG@10**|**NDCG@100**|**comment**
185
+ :-----:|:-----:|:-----:|:-----:|:-----:
186
+ bi-encoder_msmarco_bert-base_german (new) | 0.5300 🏆 | 0.7196 🏆 |0.7360 🏆 | "OUR model"
187
+ [deepset/gbert-base-germandpr-X](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach"
188
+ [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages"
189
+ [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages"
190
+ [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 0.4350 | 0.6103 | 0.6411 | "trained on 50+ languages"
191
+ [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 0.4168 | 0.5931 | 0.6237 | "trained on large corpus, support for 50+ languages"
192
+ [svalabs/bi-electra-ms-marco-german-uncased](svalabs/bi-electra-ms-marco-german-uncased) | 0.3818 | 0.5663 | 0.5986 | "most similar to OUR model"
193
+ [BM25](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) | 0.3196 | 0.5377 | 0.5740 | "lexical approach"
194
+
195
+ **It is crucial to understand that the comparisons are also made with models based on other transformer approaches.**
196
+ For example, in particular DPR is theoretically a more up-to-date approach that is nevertheless beaten.
197
+ A direct comparison based on the same approach can be made with svalabs.
198
+ In this case, the model presented here outperforms its predecessor by up to 14 percentage points.
199
+
200
+ Note:
201
+ - Texts used for evaluation are sometimes very long. All models, except for BM25 approach, truncate the incoming texts some point. This can decrease performance.
202
+ - Evaluation of deepset's gbert-base-germandpr model might give an incorrect impression. The model was originally trained on the data we used for evaluation (not 1:1 but almost).
203
+
204
+ ## Acknowledgment
205
+
206
+ This work is a collaboration between [Technical University of Applied Sciences Wildau (TH Wildau)](https://en.th-wildau.de/) and [sense.ai.tion GmbH](https://senseaition.com/).
207
+ You can contact us via:
208
+ * [Philipp Müller (M.Eng.)](https://www.linkedin.com/in/herrphilipps); Author
209
+ * [Prof. Dr. Janett Mohnke](mailto:icampus@th-wildau.de); TH Wildau
210
+ * [Dr. Matthias Boldt, Jörg Oehmichen](mailto:info@senseaition.com); sense.AI.tion GmbH
211
+
212
+ This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".
213
+
214
+ <div style="display:flex">
215
+ <div style="padding-left:20px;">
216
+ <a href="https://efre.brandenburg.de/efre/de/"><img src="https://huggingface.co/datasets/PM-AI/germandpr-beir/resolve/main/res/EFRE-Logo_rechts_oweb_en_rgb.jpeg" alt="Logo of European Regional Development Fund (EFRE)" width="200"/></a>
217
+ </div>
218
+ <div style="padding-left:20px;">
219
+ <a href="https://www.senseaition.com"><img src="https://senseaition.com/wp-content/uploads/thegem-logos/logo_c847aaa8f42141c4055d4a8665eb208d_3x.png" alt="Logo of senseaition GmbH" width="200"/></a>
220
+ </div>
221
+ <div style="padding-left:20px;">
222
+ <a href="https://www.th-wildau.de"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/TH_Wildau_Logo.png/640px-TH_Wildau_Logo.png" alt="Logo of TH Wildau" width="180"/></a>
223
+ </div>
224
+ </div>