Shitao commited on
Commit
61f0050
1 Parent(s): 9ebc0ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -11
README.md CHANGED
@@ -4,9 +4,11 @@ tags:
4
  license: mit
5
  ---
6
 
 
7
  For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
8
 
9
- # BGE-M3
 
10
  In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
11
  - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
12
  - Multi-Linguality: It can support more than 100 working languages.
@@ -23,12 +25,14 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
23
 
24
 
25
  ## News:
 
26
  - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
27
 
28
 
29
  ## Specs
30
 
31
  - Model
 
32
  | Model Name | Dimension | Sequence Length | Introduction |
33
  |:----:|:---:|:---:|:---:|
34
  | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
@@ -45,7 +49,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
45
  | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
46
 
47
 
48
-
49
  ## FAQ
50
 
51
  **1. Introduction for different retrieval methods**
@@ -54,7 +57,6 @@ Utilizing the re-ranking model (e.g., [bge-reranker](https://github.com/FlagOpen
54
  - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
55
  - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
56
 
57
-
58
  **2. Comparison with BGE-v1.5 and other monolingual models**
59
 
60
  BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
@@ -74,6 +76,11 @@ For sparse retrieval methods, most open-source libraries currently do not suppor
74
  Contributions from the community are welcome.
75
 
76
 
 
 
 
 
 
77
  **4. How to fine-tune bge-M3 model?**
78
 
79
  You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
@@ -215,10 +222,10 @@ print(model.compute_score(sentence_pairs,
215
  - Long Document Retrieval
216
  - MLDR:
217
  ![avatar](./imgs/long.jpg)
218
- Please note that MLDR is a document retrieval dataset we constructed via LLM,
219
  covering 13 languages, including test set, validation set, and training set.
220
  We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
221
- Therefore, comparing baseline with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
222
  Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
223
  We believe that this data will be helpful for the open-source community in training document retrieval models.
224
 
@@ -234,22 +241,29 @@ The small-batch strategy is simple but effective, which also can used to fine-tu
234
  - MCLS: A simple method to improve the performance on long text without fine-tuning.
235
  If you have no enough resource to fine-tuning model with long text, the method is useful.
236
 
237
- Refer to our [report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf) for more details.
238
 
239
  **The fine-tuning codes and datasets will be open-sourced in the near future.**
240
 
241
 
242
-
243
  ## Acknowledgement
244
 
245
- Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
246
- Thanks to the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserini](https://github.com/castorini/pyserini).
 
247
 
248
 
249
  ## Citation
250
 
251
- If you find this repository useful, please consider giving a star :star: and a citation
252
 
253
  ```
254
-
 
 
 
 
 
 
 
255
  ```
 
4
  license: mit
5
  ---
6
 
7
+
8
  For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding
9
 
10
+ # BGE-M3 ([paper](https://arxiv.org/pdf/2402.03216.pdf), [code](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3))
11
+
12
  In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
13
  - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
14
  - Multi-Linguality: It can support more than 100 working languages.
 
25
 
26
 
27
  ## News:
28
+ - 2/6/2024: We release the [MLDR](https://huggingface.co/datasets/Shitao/MLDR) (a long document retrieval dataset covering 13 languages) and [evaluation pipeline](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR).
29
  - 2/1/2024: **Thanks for the excellent tool from Vespa.** You can easily use multiple modes of BGE-M3 following this [notebook](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)
30
 
31
 
32
  ## Specs
33
 
34
  - Model
35
+
36
  | Model Name | Dimension | Sequence Length | Introduction |
37
  |:----:|:---:|:---:|:---:|
38
  | [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised|
 
49
  | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Docuemtn Retrieval Dataset, covering 13 languages|
50
 
51
 
 
52
  ## FAQ
53
 
54
  **1. Introduction for different retrieval methods**
 
57
  - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, [unicoil](https://arxiv.org/pdf/2106.14807.pdf), and [splade](https://arxiv.org/abs/2107.05720)
58
  - Multi-vector retrieval: use multiple vectors to represent a text, e.g., [ColBERT](https://arxiv.org/abs/2004.12832).
59
 
 
60
  **2. Comparison with BGE-v1.5 and other monolingual models**
61
 
62
  BGE-M3 is a multilingual model, and its ability in monolingual embedding retrieval may not surpass models specifically designed for single languages.
 
76
  Contributions from the community are welcome.
77
 
78
 
79
+ In our experiments, we use [Pyserini](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse) and Faiss to do hybrid retrieval.
80
+ **Now you can ou can try the hybrid mode of BGE-M3 in [Vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb
81
+ ). Thanks @jobergum.**
82
+
83
+
84
  **4. How to fine-tune bge-M3 model?**
85
 
86
  You can follow the common in this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
 
222
  - Long Document Retrieval
223
  - MLDR:
224
  ![avatar](./imgs/long.jpg)
225
+ Please note that [MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a document retrieval dataset we constructed via LLM,
226
  covering 13 languages, including test set, validation set, and training set.
227
  We utilized the training set from MLDR to enhance the model's long document retrieval capabilities.
228
+ Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable.
229
  Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets.
230
  We believe that this data will be helpful for the open-source community in training document retrieval models.
231
 
 
241
  - MCLS: A simple method to improve the performance on long text without fine-tuning.
242
  If you have no enough resource to fine-tuning model with long text, the method is useful.
243
 
244
+ Refer to our [report](https://arxiv.org/pdf/2402.03216.pdf) for more details.
245
 
246
  **The fine-tuning codes and datasets will be open-sourced in the near future.**
247
 
248
 
 
249
  ## Acknowledgement
250
 
251
+ Thanks the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc.
252
+ Thanks the open-sourced libraries like [Tevatron](https://github.com/texttron/tevatron), [pyserial](https://github.com/pyserial/pyserial).
253
+
254
 
255
 
256
  ## Citation
257
 
258
+ If you find this repository useful, please consider giving a star :star: and citation
259
 
260
  ```
261
+ @misc{bge-m3,
262
+ title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
263
+ author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
264
+ year={2024},
265
+ eprint={2402.03216},
266
+ archivePrefix={arXiv},
267
+ primaryClass={cs.CL}
268
+ }
269
  ```