BAAI
/

ldwang commited on
Commit
e763119
1 Parent(s): b29c29f

update readme

Browse files
Files changed (1) hide show
  1. README.md +64 -33
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - Transfomers
9
  ---
10
 
 
11
  <h1 align="center">FlagEmbedding</h1>
12
 
13
 
@@ -17,20 +18,22 @@ tags:
17
  <a href=#usage>Usage</a> |
18
  <a href="#evaluation">Evaluation</a> |
19
  <a href="#train">Train</a> |
 
20
  <a href="#license">License</a>
21
  <p>
22
  </h4>
23
 
24
- For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
25
 
26
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
27
 
28
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
29
- And it also can be used in vector databases for LLMs.
30
 
31
  ************* 🌟**Updates**🌟 *************
 
32
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
33
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
34
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
35
 
36
 
@@ -38,37 +41,42 @@ And it also can be used in vector databases for LLMs.
38
 
39
  `bge` is short for `BAAI general embedding`.
40
 
41
- | Model | Language | Description | query instruction for retrieval |
42
  |:-------------------------------|:--------:| :--------:| :--------:|
43
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
44
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
45
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
46
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
47
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
48
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
49
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
50
 
51
-
52
 
53
  ## Usage
54
 
55
- * **Using FlagEmbedding**
 
 
 
56
  ```
57
  pip install -U FlagEmbedding
58
  ```
59
- See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
60
 
61
  ```python
62
  from FlagEmbedding import FlagModel
63
  sentences = ["样例数据-1", "样例数据-2"]
64
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
65
- embeddings = model.encode(sentences)
66
- print(embeddings)
 
 
67
 
68
- # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
69
- # corpus in retrieval task can still use encode() or encode_corpus()
70
  queries = ['query_1', 'query_2']
71
- passages = ["样例段落-1", "样例段落-2"]
72
  q_embeddings = model.encode_queries(queries)
73
  p_embeddings = model.encode(passages)
74
  scores = q_embeddings @ p_embeddings.T
@@ -78,7 +86,7 @@ The value of argument `query_instruction_for_retrieval` see [Model List](https:/
78
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
79
 
80
 
81
- * **Using Sentence-Transformers**
82
 
83
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
84
 
@@ -89,15 +97,18 @@ pip install -U sentence-transformers
89
  from sentence_transformers import SentenceTransformer
90
  sentences = ["样例数据-1", "样例数据-2"]
91
  model = SentenceTransformer('BAAI/bge-large-zh')
92
- embeddings = model.encode(sentences, normalize_embeddings=True)
93
- print(embeddings)
 
 
94
  ```
95
- For retrieval task,
96
- each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
 
97
  ```python
98
  from sentence_transformers import SentenceTransformer
99
- queries = ["手机开不了机怎么办?"]
100
- passages = ["样例段落-1", "样例段落-2"]
101
  instruction = "为这个句子生成表示以用于检索相关文章:"
102
 
103
  model = SentenceTransformer('BAAI/bge-large-zh')
@@ -106,7 +117,23 @@ p_embeddings = model.encode(passages, normalize_embeddings=True)
106
  scores = q_embeddings @ p_embeddings.T
107
  ```
108
 
109
- * **Using HuggingFace Transformers**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
112
 
@@ -122,7 +149,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
122
 
123
  # Tokenize sentences
124
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
125
- # for retrieval task, add an instruction to query
126
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
127
 
128
  # Compute token embeddings
@@ -166,7 +193,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
166
 
167
 
168
  - **C-MTEB**:
169
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
170
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
171
 
172
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -194,7 +221,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
194
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
195
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
196
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
197
- In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
198
  We used the AdamW optimizer and the learning rate is 2e-5.
199
 
200
  **Pre-training data**:
@@ -203,8 +230,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
203
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
204
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
205
  - Chinese:
206
- - Subset of [wudao](https://github.com/BAAI-WuDao/Data)
207
- - [baidu-baike](https://baike.baidu.com/)
208
 
209
 
210
  **2. Finetune**
@@ -218,11 +244,11 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
218
  We used the AdamW optimizer and the learning rate is 1e-5.
219
  The temperature for contrastive loss is 0.01.
220
 
221
- For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
222
- For english, the instruction is `Represent this sentence for searching relevant passages: `;
223
- For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
224
- In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
225
-
226
 
227
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
228
  You can easily finetune your model with it.
@@ -238,5 +264,10 @@ You can easily finetune your model with it.
238
  We will continually update the embedding models and training codes,
239
  hoping to promote the development of the embedding model community.
240
 
 
 
241
  ## License
242
- FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
 
 
 
8
  - Transfomers
9
  ---
10
 
11
+
12
  <h1 align="center">FlagEmbedding</h1>
13
 
14
 
 
18
  <a href=#usage>Usage</a> |
19
  <a href="#evaluation">Evaluation</a> |
20
  <a href="#train">Train</a> |
21
+ <a href="#contact">Contact</a> |
22
  <a href="#license">License</a>
23
  <p>
24
  </h4>
25
 
26
+ More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
27
 
28
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
29
 
30
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
31
+ And it also can be used in vector database for LLMs.
32
 
33
  ************* 🌟**Updates**🌟 *************
34
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
35
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
36
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
37
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
38
 
39
 
 
41
 
42
  `bge` is short for `BAAI general embedding`.
43
 
44
+ | Model | Language | Description | query instruction for retrieval\* |
45
  |:-------------------------------|:--------:| :--------:| :--------:|
46
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
47
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
48
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
49
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
50
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
51
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
52
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
53
 
54
+ \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
55
 
56
  ## Usage
57
 
58
+ Here are some examples to use `bge` models with
59
+ [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
60
+
61
+ #### Using FlagEmbedding
62
  ```
63
  pip install -U FlagEmbedding
64
  ```
65
+ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
66
 
67
  ```python
68
  from FlagEmbedding import FlagModel
69
  sentences = ["样例数据-1", "样例数据-2"]
70
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
71
+ embeddings_1 = model.encode(sentences)
72
+ embeddings_2 = model.encode(sentences)
73
+ similarity = embeddings_1 @ embeddings_2.T
74
+ print(similarity)
75
 
76
+ # for s2p(short query to long passage) retrieval task, please use encode_queries() which will automatically add the instruction to each query
77
+ # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
78
  queries = ['query_1', 'query_2']
79
+ passages = ["样例文档-1", "样例文档-2"]
80
  q_embeddings = model.encode_queries(queries)
81
  p_embeddings = model.encode(passages)
82
  scores = q_embeddings @ p_embeddings.T
 
86
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
87
 
88
 
89
+ #### Using Sentence-Transformers
90
 
91
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
92
 
 
97
  from sentence_transformers import SentenceTransformer
98
  sentences = ["样例数据-1", "样例数据-2"]
99
  model = SentenceTransformer('BAAI/bge-large-zh')
100
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
101
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
102
+ similarity = embeddings_1 @ embeddings_2.T
103
+ print(similarity)
104
  ```
105
+ For s2p(short query to long passage) retrieval task,
106
+ each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
107
+ But the instruction is not needed for passages.
108
  ```python
109
  from sentence_transformers import SentenceTransformer
110
+ queries = ['query_1', 'query_2']
111
+ passages = ["样例文档-1", "样例文档-2"]
112
  instruction = "为这个句子生成表示以用于检索相关文章:"
113
 
114
  model = SentenceTransformer('BAAI/bge-large-zh')
 
117
  scores = q_embeddings @ p_embeddings.T
118
  ```
119
 
120
+ #### Using Langchain
121
+
122
+ You can use `bge` in langchain like this:
123
+ ```python
124
+ from langchain.embeddings import HuggingFaceBgeEmbeddings
125
+ model_name = "BAAI/bge-small-en"
126
+ model_kwargs = {'device': 'cuda'}
127
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
128
+ model_norm = HuggingFaceBgeEmbeddings(
129
+ model_name=model_name,
130
+ model_kwargs=model_kwargs,
131
+ encode_kwargs=encode_kwargs
132
+ )
133
+ ```
134
+
135
+
136
+ #### Using HuggingFace Transformers
137
 
138
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
139
 
 
149
 
150
  # Tokenize sentences
151
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
152
+ # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
153
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
154
 
155
  # Compute token embeddings
 
193
 
194
 
195
  - **C-MTEB**:
196
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
197
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
198
 
199
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
221
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
222
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
223
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
224
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
225
  We used the AdamW optimizer and the learning rate is 2e-5.
226
 
227
  **Pre-training data**:
 
230
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
231
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
232
  - Chinese:
233
+ - [wudao](https://github.com/BAAI-WuDao/Data)
 
234
 
235
 
236
  **2. Finetune**
 
244
  We used the AdamW optimizer and the learning rate is 1e-5.
245
  The temperature for contrastive loss is 0.01.
246
 
247
+ Besides, we add instruction to the query for s2p(short query to long passage) retrieval task in the training (add nothing to passages).
248
+ For English, the instruction is `Represent this sentence for searching relevant passages: `;
249
+ For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
250
+ In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
251
+ Noted that the instruction is not needed for passages.
252
 
253
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
254
  You can easily finetune your model with it.
 
264
  We will continually update the embedding models and training codes,
265
  hoping to promote the development of the embedding model community.
266
 
267
+
268
+
269
  ## License
270
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
271
+
272
+
273
+