maidalun1020 commited on
Commit
30aaf52
·
1 Parent(s): 2579c47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -34
README.md CHANGED
@@ -20,10 +20,17 @@ license: apache-2.0
20
  </a>
21
  </p>
22
 
 
23
  <p align="left">
24
  <a href="https://github.com/netease-youdao/BCEmbedding">GitHub</a>
25
  </p>
26
 
 
 
 
 
 
 
27
  <details open="open">
28
  <summary>Click to Open Contents</summary>
29
 
@@ -33,7 +40,8 @@ license: apache-2.0
33
  - <a href="#-model-list" target="_Self">🍎 Model List</a>
34
  - <a href="#-manual" target="_Self">📖 Manual</a>
35
  - <a href="#installation" target="_Self">Installation</a>
36
- - <a href="#quick-start" target="_Self">Quick Start</a>
 
37
  - <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
38
  - <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
39
  - <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
@@ -127,17 +135,20 @@ Existing embedding models often encounter performance challenges in bilingual an
127
  ### Installation
128
 
129
  First, create a conda environment and activate it.
 
130
  ```bash
131
  conda create --name bce python=3.10 -y
132
  conda activate bce
133
  ```
134
 
135
- Then install `BCEmbedding`:
 
136
  ```bash
137
- pip install git+https://github.com/netease-youdao/BCEmbedding.git
138
  ```
139
 
140
  Or install from source:
 
141
  ```bash
142
  git clone git@github.com:netease-youdao/BCEmbedding.git
143
  cd BCEmbedding
@@ -146,7 +157,9 @@ pip install -v -e .
146
 
147
  ### Quick Start
148
 
149
- Use `EmbeddingModel` by `BCEmbedding`, and `cls` [pooler](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/models/embedding.py#L24) is default.
 
 
150
 
151
  ```python
152
  from BCEmbedding import EmbeddingModel
@@ -161,7 +174,7 @@ model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")
161
  embeddings = model.encode(sentences)
162
  ```
163
 
164
- Use `RerankerModel` by `BCEmbedding` to calculate relevant scores and rerank:
165
 
166
  ```python
167
  from BCEmbedding import RerankerModel
@@ -183,6 +196,164 @@ scores = model.compute_score(sentence_pairs)
183
  rerank_results = model.rerank(query, passages)
184
  ```
185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  ## ⚙️ Evaluation
187
 
188
  ### Evaluate Semantic Representation by MTEB
@@ -193,9 +364,9 @@ We provide evaluateion tools for `embedding` and `reranker` models, based on [MT
193
 
194
  #### 1. Embedding Models
195
 
196
- Just run following cmd to evaluate `your_embedding_model` (e.g. `maidalun1020/bce-embedding-base_v1`) in **monolingual, bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
197
 
198
- 运行下面命令评测`your_embedding_model`(比如,`maidalun1020/bce-embedding-base_v1`)。评测任务将会在**单语种,双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
199
 
200
  ```bash
201
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls
@@ -206,8 +377,11 @@ The total evaluation tasks contain ***114 datastes*** of **"Retrieval", "STS", "
206
  评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的 ***114个数据集***。
207
 
208
  ***NOTE:***
209
- - All models are evaluated in their **recommended pooling method (`pooler`)**. "jina-embeddings-v2-base-en", "m3e-base" and "m3e-large" use `mean` pooler, while the others use `cls`.
 
 
210
  - "jina-embeddings-v2-base-en" model should be loaded with `trust_remote_code`.
 
211
  ```bash
212
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean
213
 
@@ -215,14 +389,14 @@ python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path j
215
  ```
216
 
217
  ***注意:***
218
- - 所有模型的评测采用各自推荐的`pooler`。"jina-embeddings-v2-base-en" "m3e-base""m3e-large"的 `pooler`采用`mean`,其他模型的`pooler`采用`cls`.
219
  - "jina-embeddings-v2-base-en"模型在载入时需要`trust_remote_code`。
220
 
221
  #### 2. Reranker Models
222
 
223
- Run following cmd to evaluate `your_reranker_model` (e.g. "maidalun1020/bce-reranker-base_v1") in **monolingual, bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
224
 
225
- 运行下面命令评测`your_reranker_model`(比如,`maidalun1020/bce-reranker-base_v1`)。评测任务将会在**单语种,双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
226
 
227
  ```bash
228
  python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1
@@ -323,25 +497,30 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
323
 
324
  #### 1. Embedding Models
325
 
326
- | Model | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | Avg |
327
- |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
328
- | bge-base-en-v1.5 | 37.14 | 55.06 | 75.45 | 59.73 | 43.05 | 37.74 | 47.20 |
329
- | bge-base-zh-v1.5 | 47.60 | 63.72 | 77.40 | 63.38 | 54.85 | 32.56 | 53.60 |
330
- | bge-large-en-v1.5 | 37.15 | 54.09 | 75.00 | 59.24 | 42.68 | 37.32 | 46.82 |
331
- | bge-large-zh-v1.5 | 47.54 | 64.73 | **79.14** | 64.19 | 55.88 | 33.26 | 54.21 |
332
- | jina-embeddings-v2-base-en | 31.58 | 54.28 | 74.84 | 58.42 | 41.16 | 34.67 | 44.29 |
333
- | m3e-base | 46.29 | 63.93 | 71.84 | 64.08 | 52.38 | 37.84 | 53.54 |
334
- | m3e-large | 34.85 | 59.74 | 67.69 | 60.07 | 48.99 | 31.62 | 46.78 |
335
- | ***bce-embedding-base_v1*** | **57.60** | **65.73** | 74.96 | **69.00** | **57.29** | **38.95** | **59.43** |
 
 
 
 
 
336
 
337
  ***NOTE:***
338
- - Our ***bce-embedding-base_v1*** outperforms other opensource embedding models with various model size.
339
  - ***114 datastes*** of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering"** in `["en", "zh", "en-zh", "zh-en"]` setting.
340
  - The [crosslingual evaluation datasets](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py) we released belong to `Retrieval` task.
341
  - More evaluation details please check [Embedding Models Evaluation Summary](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md).
342
 
343
  ***要点:***
344
- - 对比所有开源的各种规模的embedding模型,***bce-embedding-base_v1*** 表现最好。
345
  - 评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的共 ***114个数据集***。
346
  - 我们开源的[跨语种语义表征评测数据](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py)属于`Retrieval`任务。
347
  - 更详细的评测结果详见[Embedding模型指标汇总](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md)。
@@ -368,16 +547,8 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
368
 
369
  #### 1. Multiple Domains Scenarios
370
 
371
- | Embedding Models | WithoutReranker <br> [*hit_rate/mrr*] | CohereRerank <br> [*hit_rate/mrr*] | bge-reranker-large <br> [*hit_rate/mrr*] | ***bce-reranker-base_v1*** <br> [*hit_rate/mrr*] |
372
- |:-------------------------------|:--------:|:--------:|:--------:|:--------:|
373
- | OpenAI-ada-2 | 81.04/57.35 | 88.35/67.83 | 88.89/69.64 | **90.71/75.46** |
374
- | bge-large-en-v1.5 | 52.67/34.69 | 64.59/52.11 | 64.71/52.05 | **65.36/55.50** |
375
- | bge-large-zh-v1.5 | 69.81/47.38 | 79.37/62.13 | 80.11/63.95 | **81.19/68.50** |
376
- | llm-embedder | 50.85/33.26 | 63.62/51.45 | 63.54/51.32 | **64.47/54.98** |
377
- | CohereV3-en | 53.10/35.39 | 65.75/52.80 | 66.29/53.31 | **66.91/56.93** |
378
- | CohereV3-multilingual | 79.80/57.22 | 86.34/66.62 | 86.76/68.56 | **88.35/73.73** |
379
- | JinaAI-v2-Base-en | 50.27/32.31 | 63.97/51.10 | 64.28/51.83 | **64.82/54.98** |
380
- | ***bce-embedding-base_v1*** | **85.91/62.36** | **91.25/69.38** | **91.80/71.13** | ***93.46/77.02*** |
381
 
382
  ***NOTE:***
383
  - In `WithoutReranker` setting, our `bce-embedding-base_v1` outperforms all the other embedding models.
@@ -401,7 +572,8 @@ Welcome to scan the QR code below and join the WeChat group.
401
 
402
  欢迎大家扫码加入官方微信交流群。
403
 
404
- <img src="https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/assets/Wechat.jpg" width="20%" height="auto">
 
405
 
406
  ## ✏️ Citation
407
 
 
20
  </a>
21
  </p>
22
 
23
+ 最新bce-embedding-base_v1相关信息,以及更多MTEB和RAG相关评测细节,请移步:
24
  <p align="left">
25
  <a href="https://github.com/netease-youdao/BCEmbedding">GitHub</a>
26
  </p>
27
 
28
+ 主要特点:
29
+ 1、中英双语,以及中英跨语种能力;
30
+ 2、RAG优化,适配更多真实业务场景;
31
+ 3、方便集成进langchain和llamaindex。
32
+
33
+ -----------------------------------------
34
  <details open="open">
35
  <summary>Click to Open Contents</summary>
36
 
 
40
  - <a href="#-model-list" target="_Self">🍎 Model List</a>
41
  - <a href="#-manual" target="_Self">📖 Manual</a>
42
  - <a href="#installation" target="_Self">Installation</a>
43
+ - <a href="#quick-start" target="_Self">Quick Start (`transformers`, `sentence-transformers`)</a>
44
+ - <a href="#integrations-for-rag-frameworks" target="_Self">Integrations for RAG Frameworks (`langchain`, `llama_index`)</a>
45
  - <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
46
  - <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
47
  - <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
 
135
  ### Installation
136
 
137
  First, create a conda environment and activate it.
138
+
139
  ```bash
140
  conda create --name bce python=3.10 -y
141
  conda activate bce
142
  ```
143
 
144
+ Then install `BCEmbedding` for minimal installation:
145
+
146
  ```bash
147
+ pip install BCEmbedding==0.1.1
148
  ```
149
 
150
  Or install from source:
151
+
152
  ```bash
153
  git clone git@github.com:netease-youdao/BCEmbedding.git
154
  cd BCEmbedding
 
157
 
158
  ### Quick Start
159
 
160
+ #### 1. Based on `BCEmbedding`
161
+
162
+ Use `EmbeddingModel`, and `cls` [pooler](./BCEmbedding/models/embedding.py#L24) is default.
163
 
164
  ```python
165
  from BCEmbedding import EmbeddingModel
 
174
  embeddings = model.encode(sentences)
175
  ```
176
 
177
+ Use `RerankerModel` to calculate relevant scores and rerank:
178
 
179
  ```python
180
  from BCEmbedding import RerankerModel
 
196
  rerank_results = model.rerank(query, passages)
197
  ```
198
 
199
+ NOTE:
200
+
201
+ - In [`RerankerModel.rerank`](./BCEmbedding/models/reranker.py#L137) method, we provide an advanced preproccess that we use in production for making `sentence_pairs`, when "passages" are very long.
202
+
203
+ #### 2. Based on `transformers`
204
+
205
+ For `EmbeddingModel`:
206
+
207
+ ```python
208
+ from transformers import AutoModel, AutoTokenizer
209
+
210
+ # list of sentences
211
+ sentences = ['sentence_0', 'sentence_1', ...]
212
+
213
+ # init model and tokenizer
214
+ tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
215
+ model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')
216
+
217
+ device = 'cuda' # if no GPU, set "cpu"
218
+ model.to(device)
219
+
220
+ # get inputs
221
+ inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
222
+ inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()}
223
+
224
+ # get embeddings
225
+ outputs = model(**inputs_on_device, return_dict=True)
226
+ embeddings = outputs.last_hidden_state[:, 0] # cls pooler
227
+ embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # normalize
228
+ ```
229
+
230
+ For `RerankerModel`:
231
+
232
+ ```python
233
+ import torch
234
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
235
+
236
+ # init model and tokenizer
237
+ tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
238
+ model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')
239
+
240
+ device = 'cuda' # if no GPU, set "cpu"
241
+ model.to(device)
242
+
243
+ # get inputs
244
+ inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
245
+ inputs_on_device = {k: v.to(device) for k, v in inputs.items()}
246
+
247
+ # calculate scores
248
+ scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
249
+ scores = torch.sigmoid(scores)
250
+ ```
251
+
252
+ #### 3. Based on `sentence_transformers`
253
+
254
+ For `EmbeddingModel`:
255
+
256
+ ```python
257
+ from sentence_transformers import SentenceTransformer
258
+
259
+ # list of sentences
260
+ sentences = ['sentence_0', 'sentence_1', ...]
261
+
262
+ # init embedding model
263
+ ## New update for sentence-trnasformers. So clean up your "`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1" or "~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version.
264
+ model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")
265
+
266
+ # extract embeddings
267
+ embeddings = model.encode(sentences, normalize_embeddings=True)
268
+ ```
269
+
270
+ For `RerankerModel`:
271
+
272
+ ```python
273
+ from sentence_transformers import CrossEncoder
274
+
275
+ # init reranker model
276
+ model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)
277
+
278
+ # calculate scores of sentence pairs
279
+ scores = model.predict(sentence_pairs)
280
+ ```
281
+
282
+ ### Integrations for RAG Frameworks
283
+
284
+ #### 1. Used in `langchain`
285
+
286
+ ```python
287
+ from langchain.embeddings import HuggingFaceEmbeddings
288
+ from langchain_community.vectorstores import FAISS
289
+ from langchain_community.vectorstores.utils import DistanceStrategy
290
+
291
+ query = 'apples'
292
+ passages = [
293
+ 'I like apples',
294
+ 'I like oranges',
295
+ 'Apples and oranges are fruits'
296
+ ]
297
+
298
+ # init embedding model
299
+ model_name = 'maidalun1020/bce-embedding-base_v1'
300
+ model_kwargs = {'device': 'cuda'}
301
+ encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True, 'show_progress_bar': False}
302
+
303
+ embed_model = HuggingFaceEmbeddings(
304
+ model_name=model_name,
305
+ model_kwargs=model_kwargs,
306
+ encode_kwargs=encode_kwargs
307
+ )
308
+
309
+ # example #1. extract embeddings
310
+ query_embedding = embed_model.embed_query(query)
311
+ passages_embeddings = embed_model.embed_documents(passages)
312
+
313
+ # example #2. langchain retriever example
314
+ faiss_vectorstore = FAISS.from_texts(passages, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT)
315
+
316
+ retriever = faiss_vectorstore.as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.5, "k": 3})
317
+
318
+ related_passages = retriever.get_relevant_documents(query)
319
+ ```
320
+
321
+ #### 2. Used in `llama_index`
322
+
323
+ ```python
324
+ from llama_index.embeddings import HuggingFaceEmbedding
325
+ from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader
326
+ from llama_index.node_parser import SimpleNodeParser
327
+ from llama_index.llms import OpenAI
328
+
329
+ query = 'apples'
330
+ passages = [
331
+ 'I like apples',
332
+ 'I like oranges',
333
+ 'Apples and oranges are fruits'
334
+ ]
335
+
336
+ # init embedding model
337
+ model_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 64, 'device': 'cuda'}
338
+ embed_model = HuggingFaceEmbedding(**model_args)
339
+
340
+ # example #1. extract embeddings
341
+ query_embedding = embed_model.get_query_embedding(query)
342
+ passages_embeddings = embed_model.get_text_embedding_batch(passages)
343
+
344
+ # example #2. rag example
345
+ llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL'))
346
+ service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
347
+
348
+ documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data()
349
+ node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
350
+ nodes = node_parser.get_nodes_from_documents(documents[0:36])
351
+ index = VectorStoreIndex(nodes, service_context=service_context)
352
+ query_engine = index.as_query_engine()
353
+ response = query_engine.query("What is llama?")
354
+ ```
355
+
356
+
357
  ## ⚙️ Evaluation
358
 
359
  ### Evaluate Semantic Representation by MTEB
 
364
 
365
  #### 1. Embedding Models
366
 
367
+ Just run following cmd to evaluate `your_embedding_model` (e.g. `maidalun1020/bce-embedding-base_v1`) in **bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
368
 
369
+ 运行下面命令评测`your_embedding_model`(比如,`maidalun1020/bce-embedding-base_v1`)。评测任务将会在**双语和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
370
 
371
  ```bash
372
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls
 
377
  评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的 ***114个数据集***。
378
 
379
  ***NOTE:***
380
+ - **All models are evaluated in their recommended pooling method (`pooler`)**.
381
+ - `mean` pooler: "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large" and "gte-large".
382
+ - `cls` pooler: Other models.
383
  - "jina-embeddings-v2-base-en" model should be loaded with `trust_remote_code`.
384
+
385
  ```bash
386
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean
387
 
 
389
  ```
390
 
391
  ***注意:***
392
+ - 所有模型的评测采用各自推荐的`pooler`。"jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large"和"gte-large"的 `pooler`采用`mean`,其他模型的`pooler`采用`cls`.
393
  - "jina-embeddings-v2-base-en"模型在载入时需要`trust_remote_code`。
394
 
395
  #### 2. Reranker Models
396
 
397
+ Run following cmd to evaluate `your_reranker_model` (e.g. "maidalun1020/bce-reranker-base_v1") in **bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
398
 
399
+ 运行下面命令评测`your_reranker_model`(比如,`maidalun1020/bce-reranker-base_v1`)。评测任务将会在 **双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
400
 
401
  ```bash
402
  python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1
 
497
 
498
  #### 1. Embedding Models
499
 
500
+ | Model | Dimensions | Pooler | Instructions | Retrieval (47) | STS (19) | PairClassification (5) | Classification (21) | Reranking (12) | Clustering (15) | ***AVG*** (119) |
501
+ |:--------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
502
+ | bge-base-en-v1.5 | 768 | `cls` | Need | 37.14 | 55.06 | 75.45 | 59.73 | 43.00 | 37.74 | 47.19 |
503
+ | bge-base-zh-v1.5 | 768 | `cls` | Need | 47.63 | 63.72 | 77.40 | 63.38 | 54.95 | 32.56 | 53.62 |
504
+ | bge-large-en-v1.5 | 1024 | `cls` | Need | 37.18 | 54.09 | 75.00 | 59.24 | 42.47 | 37.32 | 46.80 |
505
+ | bge-large-zh-v1.5 | 1024 | `cls` | Need | 47.58 | 64.73 | 79.14 | 64.19 | 55.98 | 33.26 | 54.23 |
506
+ | e5-large-v2 | 1024 | `mean` | Need | 35.98 | 55.23 | 75.28 | 59.53 | 42.12 | 36.51 | 46.52 |
507
+ | gte-large | 1024 | `mean` | Free | 36.68 | 55.22 | 74.29 | 57.73 | 42.44 | 38.51 | 46.67 |
508
+ | gte-large-zh | 1024 | `cls` | Free | 41.15 | 64.62 | 77.58 | 62.04 | 55.62 | 33.03 | 51.51 |
509
+ | jina-embeddings-v2-base-en | 768 | `mean` | Free | 31.58 | 54.28 | 74.84 | 58.42 | 41.16 | 34.67 | 44.29 |
510
+ | m3e-base | 768 | `mean` | Free | 46.29 | 63.93 | 71.84 | 64.08 | 52.38 | 37.84 | 53.54 |
511
+ | m3e-large | 1024 | `mean` | Free | 34.85 | 59.74 | 67.69 | 60.07 | 48.99 | 31.62 | 46.78 |
512
+ | multilingual-e5-base | 768 | `mean` | Need | 54.73 | 65.49 | 76.97 | 69.72 | 55.01 | 38.44 | 58.34 |
513
+ | multilingual-e5-large | 1024 | `mean` | Need | 56.76 | 66.79 | 78.80 | 71.61 | 56.49 | 43.09 | 60.50 |
514
+ | ***bce-embedding-base_v1*** | 768 | `cls` | Free | 57.60 | 65.73 | 74.96 | 69.00 | 57.29 | 38.95 | 59.43 |
515
 
516
  ***NOTE:***
517
+ - Our ***bce-embedding-base_v1*** outperforms other opensource embedding models with comparable model size.
518
  - ***114 datastes*** of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering"** in `["en", "zh", "en-zh", "zh-en"]` setting.
519
  - The [crosslingual evaluation datasets](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py) we released belong to `Retrieval` task.
520
  - More evaluation details please check [Embedding Models Evaluation Summary](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md).
521
 
522
  ***要点:***
523
+ - 对比���他开源的相同规模的embedding模型,***bce-embedding-base_v1*** 表现最好,效果比最好的large模型稍差。
524
  - 评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的共 ***114个数据集***。
525
  - 我们开源的[跨语种语义表征评测数据](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py)属于`Retrieval`任务。
526
  - 更详细的评测结果详见[Embedding模型指标汇总](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md)。
 
547
 
548
  #### 1. Multiple Domains Scenarios
549
 
550
+
551
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/64745e955aba8edfb2ed561a/NyV_6ZrsaqUluUnxHKR_m.jpeg)
 
 
 
 
 
 
 
 
552
 
553
  ***NOTE:***
554
  - In `WithoutReranker` setting, our `bce-embedding-base_v1` outperforms all the other embedding models.
 
572
 
573
  欢迎大家扫码加入官方微信交流群。
574
 
575
+
576
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/64745e955aba8edfb2ed561a/mMlIkYn2qPXlivq4wtvyy.jpeg)
577
 
578
  ## ✏️ Citation
579