BAAI
/

ldwang commited on
Commit
3f9190f
1 Parent(s): 66577c6

update readme

Browse files
Files changed (1) hide show
  1. README.md +64 -33
README.md CHANGED
@@ -10,6 +10,7 @@ language:
10
  - zh
11
  ---
12
 
 
13
  <h1 align="center">FlagEmbedding</h1>
14
 
15
 
@@ -19,20 +20,22 @@ language:
19
  <a href=#usage>Usage</a> |
20
  <a href="#evaluation">Evaluation</a> |
21
  <a href="#train">Train</a> |
 
22
  <a href="#license">License</a>
23
  <p>
24
  </h4>
25
 
26
- For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
27
 
28
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
29
 
30
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
31
- And it also can be used in vector databases for LLMs.
32
 
33
  ************* 🌟**Updates**🌟 *************
 
34
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
35
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
36
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
37
 
38
 
@@ -40,37 +43,42 @@ And it also can be used in vector databases for LLMs.
40
 
41
  `bge` is short for `BAAI general embedding`.
42
 
43
- | Model | Language | Description | query instruction for retrieval |
44
  |:-------------------------------|:--------:| :--------:| :--------:|
45
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
46
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
47
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
48
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
49
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
50
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
51
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
52
 
53
-
54
 
55
  ## Usage
56
 
57
- * **Using FlagEmbedding**
 
 
 
58
  ```
59
  pip install -U FlagEmbedding
60
  ```
61
- See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
62
 
63
  ```python
64
  from FlagEmbedding import FlagModel
65
  sentences = ["样例数据-1", "样例数据-2"]
66
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
67
- embeddings = model.encode(sentences)
68
- print(embeddings)
 
 
69
 
70
- # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
71
- # corpus in retrieval task can still use encode() or encode_corpus()
72
  queries = ['query_1', 'query_2']
73
- passages = ["样例段落-1", "样例段落-2"]
74
  q_embeddings = model.encode_queries(queries)
75
  p_embeddings = model.encode(passages)
76
  scores = q_embeddings @ p_embeddings.T
@@ -80,7 +88,7 @@ The value of argument `query_instruction_for_retrieval` see [Model List](https:/
80
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
81
 
82
 
83
- * **Using Sentence-Transformers**
84
 
85
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
86
 
@@ -91,15 +99,18 @@ pip install -U sentence-transformers
91
  from sentence_transformers import SentenceTransformer
92
  sentences = ["样例数据-1", "样例数据-2"]
93
  model = SentenceTransformer('BAAI/bge-large-zh')
94
- embeddings = model.encode(sentences, normalize_embeddings=True)
95
- print(embeddings)
 
 
96
  ```
97
- For retrieval task,
98
- each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
 
99
  ```python
100
  from sentence_transformers import SentenceTransformer
101
- queries = ["手机开不了机怎么办?"]
102
- passages = ["样例段落-1", "样例段落-2"]
103
  instruction = "为这个句子生成表示以用于检索相关文章:"
104
 
105
  model = SentenceTransformer('BAAI/bge-large-zh')
@@ -108,7 +119,23 @@ p_embeddings = model.encode(passages, normalize_embeddings=True)
108
  scores = q_embeddings @ p_embeddings.T
109
  ```
110
 
111
- * **Using HuggingFace Transformers**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
114
 
@@ -124,7 +151,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
124
 
125
  # Tokenize sentences
126
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
127
- # for retrieval task, add an instruction to query
128
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
129
 
130
  # Compute token embeddings
@@ -168,7 +195,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
168
 
169
 
170
  - **C-MTEB**:
171
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
172
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
173
 
174
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -196,7 +223,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
196
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
197
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
198
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
199
- In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
200
  We used the AdamW optimizer and the learning rate is 2e-5.
201
 
202
  **Pre-training data**:
@@ -205,8 +232,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
205
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
206
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
207
  - Chinese:
208
- - Subset of [wudao](https://github.com/BAAI-WuDao/Data)
209
- - [baidu-baike](https://baike.baidu.com/)
210
 
211
 
212
  **2. Finetune**
@@ -220,11 +246,11 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
220
  We used the AdamW optimizer and the learning rate is 1e-5.
221
  The temperature for contrastive loss is 0.01.
222
 
223
- For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
224
- For english, the instruction is `Represent this sentence for searching relevant passages: `;
225
- For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
226
- In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
227
-
228
 
229
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
230
  You can easily finetune your model with it.
@@ -240,5 +266,10 @@ You can easily finetune your model with it.
240
  We will continually update the embedding models and training codes,
241
  hoping to promote the development of the embedding model community.
242
 
 
 
243
  ## License
244
- FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
 
 
 
10
  - zh
11
  ---
12
 
13
+
14
  <h1 align="center">FlagEmbedding</h1>
15
 
16
 
 
20
  <a href=#usage>Usage</a> |
21
  <a href="#evaluation">Evaluation</a> |
22
  <a href="#train">Train</a> |
23
+ <a href="#contact">Contact</a> |
24
  <a href="#license">License</a>
25
  <p>
26
  </h4>
27
 
28
+ More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
29
 
30
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
31
 
32
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
33
+ And it also can be used in vector database for LLMs.
34
 
35
  ************* 🌟**Updates**🌟 *************
36
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
37
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
38
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
39
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
40
 
41
 
 
43
 
44
  `bge` is short for `BAAI general embedding`.
45
 
46
+ | Model | Language | Description | query instruction for retrieval\* |
47
  |:-------------------------------|:--------:| :--------:| :--------:|
48
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
49
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
50
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
51
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
52
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
53
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
54
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
55
 
56
+ \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
57
 
58
  ## Usage
59
 
60
+ Here are some examples to use `bge` models with
61
+ [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
62
+
63
+ #### Using FlagEmbedding
64
  ```
65
  pip install -U FlagEmbedding
66
  ```
67
+ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
68
 
69
  ```python
70
  from FlagEmbedding import FlagModel
71
  sentences = ["样例数据-1", "样例数据-2"]
72
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
73
+ embeddings_1 = model.encode(sentences)
74
+ embeddings_2 = model.encode(sentences)
75
+ similarity = embeddings_1 @ embeddings_2.T
76
+ print(similarity)
77
 
78
+ # for s2p(short query to long passage) retrieval task, please use encode_queries() which will automatically add the instruction to each query
79
+ # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
80
  queries = ['query_1', 'query_2']
81
+ passages = ["样例文档-1", "样例文档-2"]
82
  q_embeddings = model.encode_queries(queries)
83
  p_embeddings = model.encode(passages)
84
  scores = q_embeddings @ p_embeddings.T
 
88
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
89
 
90
 
91
+ #### Using Sentence-Transformers
92
 
93
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
94
 
 
99
  from sentence_transformers import SentenceTransformer
100
  sentences = ["样例数据-1", "样例数据-2"]
101
  model = SentenceTransformer('BAAI/bge-large-zh')
102
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
103
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
104
+ similarity = embeddings_1 @ embeddings_2.T
105
+ print(similarity)
106
  ```
107
+ For s2p(short query to long passage) retrieval task,
108
+ each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
109
+ But the instruction is not needed for passages.
110
  ```python
111
  from sentence_transformers import SentenceTransformer
112
+ queries = ['query_1', 'query_2']
113
+ passages = ["样例文档-1", "样例文档-2"]
114
  instruction = "为这个句子生成表示以用于检索相关文章:"
115
 
116
  model = SentenceTransformer('BAAI/bge-large-zh')
 
119
  scores = q_embeddings @ p_embeddings.T
120
  ```
121
 
122
+ #### Using Langchain
123
+
124
+ You can use `bge` in langchain like this:
125
+ ```python
126
+ from langchain.embeddings import HuggingFaceBgeEmbeddings
127
+ model_name = "BAAI/bge-small-en"
128
+ model_kwargs = {'device': 'cuda'}
129
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
130
+ model_norm = HuggingFaceBgeEmbeddings(
131
+ model_name=model_name,
132
+ model_kwargs=model_kwargs,
133
+ encode_kwargs=encode_kwargs
134
+ )
135
+ ```
136
+
137
+
138
+ #### Using HuggingFace Transformers
139
 
140
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
141
 
 
151
 
152
  # Tokenize sentences
153
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
154
+ # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
155
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
156
 
157
  # Compute token embeddings
 
195
 
196
 
197
  - **C-MTEB**:
198
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
199
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
200
 
201
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
223
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
224
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
225
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
226
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
227
  We used the AdamW optimizer and the learning rate is 2e-5.
228
 
229
  **Pre-training data**:
 
232
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
233
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
234
  - Chinese:
235
+ - [wudao](https://github.com/BAAI-WuDao/Data)
 
236
 
237
 
238
  **2. Finetune**
 
246
  We used the AdamW optimizer and the learning rate is 1e-5.
247
  The temperature for contrastive loss is 0.01.
248
 
249
+ Besides, we add instruction to the query for s2p(short query to long passage) retrieval task in the training (add nothing to passages).
250
+ For English, the instruction is `Represent this sentence for searching relevant passages: `;
251
+ For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
252
+ In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
253
+ Noted that the instruction is not needed for passages.
254
 
255
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
256
  You can easily finetune your model with it.
 
266
  We will continually update the embedding models and training codes,
267
  hoping to promote the development of the embedding model community.
268
 
269
+
270
+
271
  ## License
272
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
273
+
274
+
275
+