BAAI
/

Shitao commited on
Commit
ecc1ac1
1 Parent(s): 76c437d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -26
README.md CHANGED
@@ -6,7 +6,20 @@ pipeline_tag: sentence-similarity
6
  ---
7
 
8
  <h1 align="center">FlagEmbedding</h1>
9
-
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  <h4 align="center">
12
  <p>
@@ -19,17 +32,16 @@ pipeline_tag: sentence-similarity
19
  <p>
20
  </h4>
21
 
22
- More details please refer to our github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
23
 
24
- [English](README.md) | [中文](README_zh.md)
25
 
26
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
27
  And it also can be used in vector database for LLMs.
28
 
29
  ************* 🌟**Updates**🌟 *************
30
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
31
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
32
- - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (**C-MTEB**), consisting of 31 test dataset.
33
 
34
 
35
  ## Model List
@@ -38,12 +50,12 @@ And it also can be used in vector database for LLMs.
38
 
39
  | Model | Language | Description | query instruction for retrieval |
40
  |:-------------------------------|:--------:| :--------:| :--------:|
41
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | **rank 1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
42
- | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | **rank 2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
43
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
44
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | **rank 1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/benchmark) benchmark | `为这个句子生成表示以用于检索相关文章:` |
45
- | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and **rank 2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/benchmark) benchmark | |
46
- | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
47
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
48
 
49
 
@@ -52,10 +64,12 @@ And it also can be used in vector database for LLMs.
52
 
53
  * **Using FlagEmbedding**
54
  ```
55
- pip install flag_embedding
56
  ```
 
 
57
  ```python
58
- from flag_embedding import FlagModel
59
  sentences = ["样例数据-1", "样例数据-2"]
60
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
61
  embeddings = model.encode(sentences)
@@ -134,7 +148,7 @@ print("Sentence embeddings:", sentence_embeddings)
134
 
135
  ## Evaluation
136
  `baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
137
- More details and evaluation scripts see [benchemark](benchmark/README.md).
138
 
139
  - **MTEB**:
140
 
@@ -162,8 +176,8 @@ More details and evaluation scripts see [benchemark](benchmark/README.md).
162
 
163
 
164
  - **C-MTEB**:
165
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
166
- Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
167
 
168
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
169
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
@@ -180,18 +194,17 @@ Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
180
 
181
 
182
 
183
-
184
  ## Train
185
  This section will introduce the way we used to train the general embedding.
186
- The training scripts are in [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/),
187
- and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain/) and [fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).
188
 
189
 
190
  **1. RetroMAE Pre-train**
191
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
192
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
193
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
194
- In retromae, the mask ratio of the encoder and decoder are 0.3, and 0.5 respectively.
195
  We used the AdamW optimizer and the learning rate is 2e-5.
196
 
197
  **Pre-training data**:
@@ -208,33 +221,47 @@ We used the AdamW optimizer and the learning rate is 2e-5.
208
  We fine-tune the model using a contrastive objective.
209
  The format of input data is a triple`(query, positive, negative)`.
210
  Besides the negative in the triple, we also adopt in-batch negatives strategy.
211
- We employ the cross-device negatives sharing method to share negatives among different GPUs,
212
  which can dramatically **increase the number of negatives**.
213
 
214
  We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
215
  We used the AdamW optimizer and the learning rate is 1e-5.
216
  The temperature for contrastive loss is 0.01.
217
 
218
- For the version with `*-instrcution`, we add instruction to the query for the retrieval task in the training.
219
- For English, the instruction is `Represent this sentence for searching relevant passages: `;
220
- For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
221
  In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
222
 
223
 
224
- The finetune script is accessible in this repository: [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/README.md).
225
  You can easily finetune your model with it.
226
 
227
  **Training data**:
228
 
229
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
230
 
231
- - For Chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
232
 
233
  **The data collection is to be released in the future.**
234
 
 
 
 
 
 
 
 
 
 
235
  We will continually update the embedding models and training codes,
236
  hoping to promote the development of the embedding model community.
237
 
238
 
 
 
 
 
 
239
  ## License
240
- FlagEmbedding is licensed under [MIT License](). The released models can be used for commercial purposes free of charge.
 
6
  ---
7
 
8
  <h1 align="center">FlagEmbedding</h1>
9
+ <p align="center">
10
+ <a href="https://www.python.org/">
11
+ <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
12
+ </a>
13
+ <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
14
+ <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
15
+ </a>
16
+ <a href="https://huggingface.co/C-MTEB">
17
+ <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
18
+ </a>
19
+ <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding">
20
+ <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.0.1-red">
21
+ </a>
22
+ </p>
23
 
24
  <h4 align="center">
25
  <p>
 
32
  <p>
33
  </h4>
34
 
 
35
 
36
+ [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
37
 
38
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
39
  And it also can be used in vector database for LLMs.
40
 
41
  ************* 🌟**Updates**🌟 *************
42
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
43
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
44
+ - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
45
 
46
 
47
  ## Model List
 
50
 
51
  | Model | Language | Description | query instruction for retrieval |
52
  |:-------------------------------|:--------:| :--------:| :--------:|
53
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
54
+ | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
55
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
56
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
57
+ | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
58
+ | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
59
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
60
 
61
 
 
64
 
65
  * **Using FlagEmbedding**
66
  ```
67
+ pip install FlagEmbedding
68
  ```
69
+ See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
70
+
71
  ```python
72
+ from FlagEmbedding import FlagModel
73
  sentences = ["样例数据-1", "样例数据-2"]
74
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
75
  embeddings = model.encode(sentences)
 
148
 
149
  ## Evaluation
150
  `baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
151
+ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md).
152
 
153
  - **MTEB**:
154
 
 
176
 
177
 
178
  - **C-MTEB**:
179
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
180
+ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
181
 
182
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
183
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
 
194
 
195
 
196
 
 
197
  ## Train
198
  This section will introduce the way we used to train the general embedding.
199
+ The training scripts are in [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md),
200
+ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/pretrain/README.md) and [fine-tune](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/README.md).
201
 
202
 
203
  **1. RetroMAE Pre-train**
204
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
205
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
206
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
207
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
208
  We used the AdamW optimizer and the learning rate is 2e-5.
209
 
210
  **Pre-training data**:
 
221
  We fine-tune the model using a contrastive objective.
222
  The format of input data is a triple`(query, positive, negative)`.
223
  Besides the negative in the triple, we also adopt in-batch negatives strategy.
224
+ We employ the cross-device negatives sharing method to sharing negatives among different GPUs,
225
  which can dramatically **increase the number of negatives**.
226
 
227
  We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
228
  We used the AdamW optimizer and the learning rate is 1e-5.
229
  The temperature for contrastive loss is 0.01.
230
 
231
+ For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
232
+ For english, the instruction is `Represent this sentence for searching relevant passages: `;
233
+ For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
234
  In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
235
 
236
 
237
+ The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
238
  You can easily finetune your model with it.
239
 
240
  **Training data**:
241
 
242
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
243
 
244
+ - For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
245
 
246
  **The data collection is to be released in the future.**
247
 
248
+
249
+ ## Schedule
250
+ - [x] Chinese Massive Text Embedding Benchmark
251
+ - [x] release baai-general-embedding models
252
+ - [x] release codes for training
253
+ - [ ] Training Datasets
254
+ - [ ] Multilingual model
255
+ - [ ] ...
256
+
257
  We will continually update the embedding models and training codes,
258
  hoping to promote the development of the embedding model community.
259
 
260
 
261
+ ## Contact
262
+ If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
263
+ You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
264
+
265
+
266
  ## License
267
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.