ldwang commited on
Commit
05c14ca
1 Parent(s): 0cceaac
Files changed (1) hide show
  1. README.md +72 -16
README.md CHANGED
@@ -2604,11 +2604,25 @@ pipeline_tag: sentence-similarity
2604
 
2605
 
2606
  <h1 align="center">FlagEmbedding</h1>
2607
-
 
 
 
 
 
 
 
 
 
 
 
 
 
2608
 
2609
  <h4 align="center">
2610
  <p>
2611
  <a href=#model-list>Model List</a> |
 
2612
  <a href=#usage>Usage</a> |
2613
  <a href="#evaluation">Evaluation</a> |
2614
  <a href="#train">Train</a> |
@@ -2617,7 +2631,6 @@ pipeline_tag: sentence-similarity
2617
  <p>
2618
  </h4>
2619
 
2620
- More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2621
 
2622
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2623
 
@@ -2625,9 +2638,9 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
2625
  And it also can be used in vector database for LLMs.
2626
 
2627
  ************* 🌟**Updates**🌟 *************
2628
- - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
2629
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2630
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
2631
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2632
 
2633
 
@@ -2637,16 +2650,42 @@ And it also can be used in vector database for LLMs.
2637
 
2638
  | Model | Language | Description | query instruction for retrieval\* |
2639
  |:-------------------------------|:--------:| :--------:| :--------:|
2640
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2641
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2642
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2643
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2644
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2645
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2646
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2647
 
2648
  \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
2649
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2650
  ## Usage
2651
 
2652
  Here are some examples to use `bge` models with
@@ -2660,10 +2699,11 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
2660
 
2661
  ```python
2662
  from FlagEmbedding import FlagModel
2663
- sentences = ["样例数据-1", "样例数据-2"]
 
2664
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2665
- embeddings_1 = model.encode(sentences)
2666
- embeddings_2 = model.encode(sentences)
2667
  similarity = embeddings_1 @ embeddings_2.T
2668
  print(similarity)
2669
 
@@ -2678,6 +2718,7 @@ scores = q_embeddings @ p_embeddings.T
2678
  The value of argument `query_instruction_for_retrieval` see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list).
2679
 
2680
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
 
2681
 
2682
 
2683
  #### Using Sentence-Transformers
@@ -2689,10 +2730,11 @@ pip install -U sentence-transformers
2689
  ```
2690
  ```python
2691
  from sentence_transformers import SentenceTransformer
2692
- sentences = ["样例数据-1", "样例数据-2"]
 
2693
  model = SentenceTransformer('BAAI/bge-large-zh')
2694
- embeddings_1 = model.encode(sentences, normalize_embeddings=True)
2695
- embeddings_2 = model.encode(sentences, normalize_embeddings=True)
2696
  similarity = embeddings_1 @ embeddings_2.T
2697
  print(similarity)
2698
  ```
@@ -2719,10 +2761,11 @@ from langchain.embeddings import HuggingFaceBgeEmbeddings
2719
  model_name = "BAAI/bge-small-en"
2720
  model_kwargs = {'device': 'cuda'}
2721
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2722
- model_norm = HuggingFaceBgeEmbeddings(
2723
  model_name=model_name,
2724
  model_kwargs=model_kwargs,
2725
- encode_kwargs=encode_kwargs
 
2726
  )
2727
  ```
2728
 
@@ -2834,7 +2877,7 @@ Besides the negative in the triple, we also adopt in-batch negatives strategy.
2834
  We employ the cross-device negatives sharing method to share negatives among different GPUs,
2835
  which can dramatically **increase the number of negatives**.
2836
 
2837
- We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so there are **65,535** negatives for each query in a batch).
2838
  We used the AdamW optimizer and the learning rate is 1e-5.
2839
  The temperature for contrastive loss is 0.01.
2840
 
@@ -2851,14 +2894,27 @@ You can easily finetune your model with it.
2851
 
2852
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
2853
 
2854
- - For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
2855
 
2856
  **The data collection is to be released in the future.**
2857
 
 
 
 
 
 
 
 
 
 
2858
  We will continually update the embedding models and training codes,
2859
  hoping to promote the development of the embedding model community.
2860
 
2861
 
 
 
 
 
2862
 
2863
  ## License
2864
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
2604
 
2605
 
2606
  <h1 align="center">FlagEmbedding</h1>
2607
+ <p align="center">
2608
+ <a href="https://github.com/FlagOpen/FlagEmbedding">
2609
+ <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
2610
+ </a>
2611
+ <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
2612
+ <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
2613
+ </a>
2614
+ <a href="https://huggingface.co/C-MTEB">
2615
+ <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
2616
+ </a>
2617
+ <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
2618
+ <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.0-red">
2619
+ </a>
2620
+ </p>
2621
 
2622
  <h4 align="center">
2623
  <p>
2624
  <a href=#model-list>Model List</a> |
2625
+ <a href=#frequently-asked-questions>FAQ</a> |
2626
  <a href=#usage>Usage</a> |
2627
  <a href="#evaluation">Evaluation</a> |
2628
  <a href="#train">Train</a> |
 
2631
  <p>
2632
  </h4>
2633
 
 
2634
 
2635
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2636
 
 
2638
  And it also can be used in vector database for LLMs.
2639
 
2640
  ************* 🌟**Updates**🌟 *************
2641
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [this](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
2642
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2643
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2644
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2645
 
2646
 
 
2650
 
2651
  | Model | Language | Description | query instruction for retrieval\* |
2652
  |:-------------------------------|:--------:| :--------:| :--------:|
2653
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2654
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2655
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2656
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2657
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2658
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2659
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2660
 
2661
  \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
2662
 
2663
+ ## Frequently asked questions
2664
+
2665
+ 1. The similarity score between two dissimilar sentence is higher than 0.5
2666
+
2667
+ The similarity distribution of the current BGE model is about in the interval \[0.6, 1\].
2668
+ So a similarity score greater than 0.5 does not indicate that the two sentence are similar.
2669
+
2670
+ For downstream tasks, such as passage retrieval or semantic similarity,
2671
+ **what matters is the relative order of the scores, not the absolute value.**
2672
+ If you need to filter similar sentences based on a similarity threshold,
2673
+ please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9).
2674
+
2675
+
2676
+ 2. When do the query instruction need to be used
2677
+
2678
+ For a retrieval task that uses short queries to find long related documents,
2679
+ it is recommended to add instructions for these short queries.
2680
+ For other tasks, it is recommended not to add instructions.
2681
+ For example, in Quora task, which needs to use a short question to search another related short questions,
2682
+ the instruction is not recommended to add.
2683
+ The best method to decide whether to add instructions for queries is choosing the setting which can achieve better performance in your task.
2684
+ In all cases, the documents/passages do not need to add the instruction, only need to consider whether to add the instruction for queries.
2685
+
2686
+
2687
+
2688
+
2689
  ## Usage
2690
 
2691
  Here are some examples to use `bge` models with
 
2699
 
2700
  ```python
2701
  from FlagEmbedding import FlagModel
2702
+ sentences_1 = ["样例数据-1", "样例数据-2"]
2703
+ sentences_2 = ["样例数据-3", "样例数据-4"]
2704
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2705
+ embeddings_1 = model.encode(sentences_1)
2706
+ embeddings_2 = model.encode(sentences_2)
2707
  similarity = embeddings_1 @ embeddings_2.T
2708
  print(similarity)
2709
 
 
2718
  The value of argument `query_instruction_for_retrieval` see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list).
2719
 
2720
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
2721
+ You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make GPUs unavailable.
2722
 
2723
 
2724
  #### Using Sentence-Transformers
 
2730
  ```
2731
  ```python
2732
  from sentence_transformers import SentenceTransformer
2733
+ sentences_1 = ["样例数据-1", "样例数据-2"]
2734
+ sentences_2 = ["样例数据-3", "样例数据-4"]
2735
  model = SentenceTransformer('BAAI/bge-large-zh')
2736
+ embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
2737
+ embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
2738
  similarity = embeddings_1 @ embeddings_2.T
2739
  print(similarity)
2740
  ```
 
2761
  model_name = "BAAI/bge-small-en"
2762
  model_kwargs = {'device': 'cuda'}
2763
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2764
+ model = HuggingFaceBgeEmbeddings(
2765
  model_name=model_name,
2766
  model_kwargs=model_kwargs,
2767
+ encode_kwargs=encode_kwargs,
2768
+ query_instruction="为这个句子生成表示以用于检索相关文章:"
2769
  )
2770
  ```
2771
 
 
2877
  We employ the cross-device negatives sharing method to share negatives among different GPUs,
2878
  which can dramatically **increase the number of negatives**.
2879
 
2880
+ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,784 (so there are **65,567** negatives for each query in a batch).
2881
  We used the AdamW optimizer and the learning rate is 1e-5.
2882
  The temperature for contrastive loss is 0.01.
2883
 
 
2894
 
2895
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
2896
 
2897
+ - For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE), and so on.
2898
 
2899
  **The data collection is to be released in the future.**
2900
 
2901
+
2902
+ ## Schedule
2903
+ - [x] Chinese Massive Text Embedding Benchmark
2904
+ - [x] release baai-general-embedding models
2905
+ - [x] release codes for training
2906
+ - [ ] Multilingual model
2907
+ - [ ] Training Datasets
2908
+ - [ ] ...
2909
+
2910
  We will continually update the embedding models and training codes,
2911
  hoping to promote the development of the embedding model community.
2912
 
2913
 
2914
+ ## Contact
2915
+ If you have any question or suggestion related to this project, feel free to open an issue or pull a request.
2916
+ You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
2917
+
2918
 
2919
  ## License
2920
  FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.