BAAI
/

bge-base-en

@@ -2604,11 +2604,25 @@ pipeline_tag: sentence-similarity
 <h1 align="center">FlagEmbedding</h1>
 <h4 align="center">
     <p>
         <a href=#model-list>Model List</a> |
         <a href=#usage>Usage</a>  |
         <a href="#evaluation">Evaluation</a> |
         <a href="#train">Train</a> |
@@ -2617,7 +2631,6 @@ pipeline_tag: sentence-similarity
     <p>
 </h4>
-More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
 [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
@@ -2625,9 +2638,9 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
 And it also can be used in vector database for LLMs.
 ************* 🌟**Updates**🌟 *************
-- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
 - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
-- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
 - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
@@ -2637,16 +2650,42 @@ And it also can be used in vector database for LLMs.
 |              Model              | Language | Description | query instruction for retrieval\* |
 |:-------------------------------|:--------:| :--------:| :--------:|
-|  [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) |   English |  rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: `  |
 |  [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) |   English |  rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: `  |
 |  [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) |   English | a small-scale model but with competitive performance  | `Represent this sentence for searching relevant passages: `  |
-|  [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) |   Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) |   Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark |   |
 |  [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) |   Chinese |  a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) |   Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章：`  |
 \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
 ## Usage
 Here are some examples to use `bge` models with
@@ -2660,10 +2699,11 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
 ```python
 from FlagEmbedding import FlagModel
-sentences = ["样例数据-1", "样例数据-2"]
 model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：")
-embeddings_1 = model.encode(sentences)
-embeddings_2 = model.encode(sentences)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
@@ -2678,6 +2718,7 @@ scores = q_embeddings @ p_embeddings.T
 The value of argument `query_instruction_for_retrieval` see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list).
 FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
 #### Using Sentence-Transformers
@@ -2689,10 +2730,11 @@ pip install -U sentence-transformers
 ```
 ```python
 from sentence_transformers import SentenceTransformer
-sentences = ["样例数据-1", "样例数据-2"]
 model = SentenceTransformer('BAAI/bge-large-zh')
-embeddings_1 = model.encode(sentences, normalize_embeddings=True)
-embeddings_2 = model.encode(sentences, normalize_embeddings=True)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
 ```
@@ -2719,10 +2761,11 @@ from langchain.embeddings import HuggingFaceBgeEmbeddings
 model_name = "BAAI/bge-small-en"
 model_kwargs = {'device': 'cuda'}
 encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
-model_norm = HuggingFaceBgeEmbeddings(
     model_name=model_name,
     model_kwargs=model_kwargs,
-    encode_kwargs=encode_kwargs
 )
 ```
@@ -2834,7 +2877,7 @@ Besides the negative in the triple, we also adopt in-batch negatives strategy.
 We employ the cross-device negatives sharing method to share negatives among different GPUs,
 which can dramatically **increase the number of negatives**.
-We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
 We used the AdamW optimizer and the learning rate is 1e-5.
 The temperature for contrastive loss is 0.01.
@@ -2851,14 +2894,27 @@ You can easily finetune your model with it.
 - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
-- For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
 **The data collection is to be released in the future.**
 We will continually update the embedding models and training codes,
 hoping to promote the development of the embedding model community.
 ## License
 FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.

 <h1 align="center">FlagEmbedding</h1>
+<p align="center">
+    <a href="https://github.com/FlagOpen/FlagEmbedding">
+            <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
+    </a>
+    <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
+        <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
+    </a>
+    <a href="https://huggingface.co/C-MTEB">
+        <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
+    </a>
+    <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
+        <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.0-red">
+    </a>
+</p>
 <h4 align="center">
     <p>
         <a href=#model-list>Model List</a> |
+        <a href=#frequently-asked-questions>FAQ</a> |
         <a href=#usage>Usage</a>  |
         <a href="#evaluation">Evaluation</a> |
         <a href="#train">Train</a> |
     <p>
 </h4>
 [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
 And it also can be used in vector database for LLMs.
 ************* 🌟**Updates**🌟 *************
+- 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
 - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
+- 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
 - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
 |              Model              | Language | Description | query instruction for retrieval\* |
 |:-------------------------------|:--------:| :--------:| :--------:|
+|  [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) |   English |  :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: `  |
 |  [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) |   English |  rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: `  |
 |  [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) |   English | a small-scale model but with competitive performance  | `Represent this sentence for searching relevant passages: `  |
+|  [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) |   Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) |   Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark |   |
 |  [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) |   Chinese |  a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章：`  |
 |  [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) |   Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章：`  |
 \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
+## Frequently asked questions
+1. The similarity score between two dissimilar sentence is higher than 0.5
+The similarity distribution of the current BGE model is about in the interval \[0.6, 1\].
+So a similarity score greater than 0.5 does not indicate that the two sentence are similar.
+For downstream tasks, such as passage retrieval or semantic similarity,
+**what matters is the relative order of the scores, not the absolute value.**
+If you need to filter similar sentences based on a similarity threshold,
+please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9).
+2. When do the query instruction need to be used
+For a retrieval task that uses short queries to find long related documents,
+it is recommended to add instructions for these short queries.
+For other tasks, it is recommended not to add instructions.
+For example, in Quora task, which needs to use a short question to search another related short questions,
+the instruction is not recommended to add.
+The best method to decide whether to add instructions for queries is choosing the setting which can achieve better performance in your task.
+In all cases, the documents/passages do not need to add the instruction, only need to consider whether to add the instruction for queries.
 ## Usage
 Here are some examples to use `bge` models with
 ```python
 from FlagEmbedding import FlagModel
+sentences_1 = ["样例数据-1", "样例数据-2"]
+sentences_2 = ["样例数据-3", "样例数据-4"]
 model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：")
+embeddings_1 = model.encode(sentences_1)
+embeddings_2 = model.encode(sentences_2)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
 The value of argument `query_instruction_for_retrieval` see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list).
 FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
+You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make GPUs unavailable.
 #### Using Sentence-Transformers
 ```
 ```python
 from sentence_transformers import SentenceTransformer
+sentences_1 = ["样例数据-1", "样例数据-2"]
+sentences_2 = ["样例数据-3", "样例数据-4"]
 model = SentenceTransformer('BAAI/bge-large-zh')
+embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
+embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
 similarity = embeddings_1 @ embeddings_2.T
 print(similarity)
 ```
 model_name = "BAAI/bge-small-en"
 model_kwargs = {'device': 'cuda'}
 encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
+model = HuggingFaceBgeEmbeddings(
     model_name=model_name,
     model_kwargs=model_kwargs,
+    encode_kwargs=encode_kwargs,
+query_instruction="为这个句子生成表示以用于检索相关文章："
 )
 ```
 We employ the cross-device negatives sharing method to share negatives among different GPUs,
 which can dramatically **increase the number of negatives**.
+We trained our model on 48 A100(40G) GPUs with a large batch size of 32,784 (so there are **65,567** negatives for each query in a batch).
 We used the AdamW optimizer and the learning rate is 1e-5.
 The temperature for contrastive loss is 0.01.
 - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
+- For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE), and so on.
 **The data collection is to be released in the future.**
+## Schedule
+- [x] Chinese Massive Text Embedding Benchmark
+- [x] release baai-general-embedding models
+- [x] release codes for training
+- [ ] Multilingual model
+- [ ] Training Datasets
+- [ ] ...
 We will continually update the embedding models and training codes,
 hoping to promote the development of the embedding model community.
+## Contact
+If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
+You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
 ## License
 FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.