ldwang commited on
Commit
4bd7067
1 Parent(s): 5d7d79c
Files changed (1) hide show
  1. README.md +34 -16
README.md CHANGED
@@ -2630,25 +2630,36 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
2630
  And it also can be used in vector databases for LLMs.
2631
 
2632
  ************* 🌟**Updates**🌟 *************
2633
- - 09/15/2023: Release [paper](https://arxiv.org/pdf/2309.07597.pdf) and [dataset](https://data.baai.ac.cn/details/BAAI-MTP).
2634
- - 09/12/2023: New Release:
 
 
2635
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2636
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
2637
- - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
 
 
 
 
 
 
2638
  - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
2639
- - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2640
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2641
- - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
 
 
2642
 
2643
 
2644
  ## Model List
2645
 
2646
  `bge` is short for `BAAI general embedding`.
2647
 
2648
- | Model | Language | | Description | query instruction for retrieval\* |
2649
  |:-------------------------------|:--------:| :--------:| :--------:|:--------:|
2650
- | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient \** | |
2651
- | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient \** | |
 
2652
  | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2653
  | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2654
  | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
@@ -2663,11 +2674,15 @@ And it also can be used in vector databases for LLMs.
2663
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2664
 
2665
 
2666
- \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2667
 
2668
- \**: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2669
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2670
 
 
 
 
 
2671
  ## Frequently asked questions
2672
 
2673
  <details>
@@ -2704,7 +2719,11 @@ please select an appropriate similarity threshold based on the similarity distri
2704
  <summary>3. When does the query instruction need to be used</summary>
2705
 
2706
  <!-- ### When does the query instruction need to be used -->
2707
-
 
 
 
 
2708
  For a retrieval task that uses short queries to find long related documents,
2709
  it is recommended to add instructions for these short queries.
2710
  **The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
@@ -2964,7 +2983,7 @@ which is more accurate than embedding model (i.e., bi-encoder) but more time-con
2964
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
2965
  We train the cross-encoder on a multilingual pair data,
2966
  The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
2967
- More details pelease refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
2968
 
2969
 
2970
  ## Contact
@@ -2974,7 +2993,8 @@ You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac
2974
 
2975
  ## Citation
2976
 
2977
- If you find our work helpful, please cite us:
 
2978
  ```
2979
  @misc{bge_embedding,
2980
  title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
@@ -2989,5 +3009,3 @@ If you find our work helpful, please cite us:
2989
  ## License
2990
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2991
 
2992
-
2993
-
 
2630
  And it also can be used in vector databases for LLMs.
2631
 
2632
  ************* 🌟**Updates**🌟 *************
2633
+ - 10/12/2023: Release [LLM-Embedder](./FlagEmbedding/llm_embedder/README.md), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
2634
+ - 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
2635
+ - 09/15/2023: The [masive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
2636
+ - 09/12/2023: New models:
2637
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2638
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
2639
+
2640
+
2641
+ <details>
2642
+ <summary>More</summary>
2643
+ <!-- ### More -->
2644
+
2645
+ - 09/07/2023: Update [fine-tune code](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md): Add script to mine hard negatives and support adding instruction during fine-tuning.
2646
  - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [available](https://huggingface.co/spaces/mteb/leaderboard).
2647
+ - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2648
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2649
+ - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2650
+
2651
+ </details>
2652
 
2653
 
2654
  ## Model List
2655
 
2656
  `bge` is short for `BAAI general embedding`.
2657
 
2658
+ | Model | Language | | Description | query instruction for retrieval [1] |
2659
  |:-------------------------------|:--------:| :--------:| :--------:|:--------:|
2660
+ | [BAAI/llm-embedder](https://huggingface.co/BAAI/llm-embedder) | English | [Inference](./FlagEmbedding/llm_embedder/README.md) [Fine-tune](./FlagEmbedding/llm_embedder/README.md) | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See [README](./FlagEmbedding/llm_embedder/README.md) |
2661
+ | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2662
+ | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | Chinese and English | [Inference](#usage-for-reranker) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker) | a cross-encoder model which is more accurate but less efficient [2] | |
2663
  | [BAAI/bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2664
  | [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
2665
  | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | English | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` |
 
2674
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2675
 
2676
 
2677
+ [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2678
 
2679
+ [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2680
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2681
 
2682
+ All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
2683
+ If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models .
2684
+
2685
+
2686
  ## Frequently asked questions
2687
 
2688
  <details>
 
2719
  <summary>3. When does the query instruction need to be used</summary>
2720
 
2721
  <!-- ### When does the query instruction need to be used -->
2722
+
2723
+ For the `bge-*-v1.5`, we improve its retrieval ability when not using instruction.
2724
+ No instruction only has a slight degradation in retrieval performance compared with using instruction.
2725
+ So you can generate embedding without instruction in all cases for convenience.
2726
+
2727
  For a retrieval task that uses short queries to find long related documents,
2728
  it is recommended to add instructions for these short queries.
2729
  **The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task.**
 
2983
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
2984
  We train the cross-encoder on a multilingual pair data,
2985
  The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
2986
+ More details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
2987
 
2988
 
2989
  ## Contact
 
2993
 
2994
  ## Citation
2995
 
2996
+ If you find this repository useful, please consider giving a star :star: and citation
2997
+
2998
  ```
2999
  @misc{bge_embedding,
3000
  title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
 
3009
  ## License
3010
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
3011