ldwang commited on
Commit
d8d8763
1 Parent(s): 81e58d4

update readme

Browse files
Files changed (1) hide show
  1. README.md +63 -34
README.md CHANGED
@@ -2604,7 +2604,6 @@ pipeline_tag: sentence-similarity
2604
  ---
2605
 
2606
 
2607
-
2608
  <h1 align="center">FlagEmbedding</h1>
2609
 
2610
 
@@ -2614,20 +2613,22 @@ pipeline_tag: sentence-similarity
2614
  <a href=#usage>Usage</a> |
2615
  <a href="#evaluation">Evaluation</a> |
2616
  <a href="#train">Train</a> |
 
2617
  <a href="#license">License</a>
2618
  <p>
2619
  </h4>
2620
 
2621
- For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2622
 
2623
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2624
 
2625
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
2626
- And it also can be used in vector databases for LLMs.
2627
 
2628
  ************* 🌟**Updates**🌟 *************
 
2629
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2630
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
2631
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2632
 
2633
 
@@ -2635,37 +2636,42 @@ And it also can be used in vector databases for LLMs.
2635
 
2636
  `bge` is short for `BAAI general embedding`.
2637
 
2638
- | Model | Language | Description | query instruction for retrieval |
2639
  |:-------------------------------|:--------:| :--------:| :--------:|
2640
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2641
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2642
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2643
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2644
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2645
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2646
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2647
 
2648
-
2649
 
2650
  ## Usage
2651
 
2652
- * **Using FlagEmbedding**
 
 
 
2653
  ```
2654
  pip install -U FlagEmbedding
2655
  ```
2656
- See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
2657
 
2658
  ```python
2659
  from FlagEmbedding import FlagModel
2660
  sentences = ["样例数据-1", "样例数据-2"]
2661
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2662
- embeddings = model.encode(sentences)
2663
- print(embeddings)
 
 
2664
 
2665
- # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
2666
- # corpus in retrieval task can still use encode() or encode_corpus()
2667
  queries = ['query_1', 'query_2']
2668
- passages = ["样例段落-1", "样例段落-2"]
2669
  q_embeddings = model.encode_queries(queries)
2670
  p_embeddings = model.encode(passages)
2671
  scores = q_embeddings @ p_embeddings.T
@@ -2675,7 +2681,7 @@ The value of argument `query_instruction_for_retrieval` see [Model List](https:/
2675
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
2676
 
2677
 
2678
- * **Using Sentence-Transformers**
2679
 
2680
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
2681
 
@@ -2686,15 +2692,18 @@ pip install -U sentence-transformers
2686
  from sentence_transformers import SentenceTransformer
2687
  sentences = ["样例数据-1", "样例数据-2"]
2688
  model = SentenceTransformer('BAAI/bge-large-zh')
2689
- embeddings = model.encode(sentences, normalize_embeddings=True)
2690
- print(embeddings)
 
 
2691
  ```
2692
- For retrieval task,
2693
- each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
 
2694
  ```python
2695
  from sentence_transformers import SentenceTransformer
2696
- queries = ["手机开不了机怎么办?"]
2697
- passages = ["样例段落-1", "样例段落-2"]
2698
  instruction = "为这个句子生成表示以用于检索相关文章:"
2699
 
2700
  model = SentenceTransformer('BAAI/bge-large-zh')
@@ -2703,7 +2712,23 @@ p_embeddings = model.encode(passages, normalize_embeddings=True)
2703
  scores = q_embeddings @ p_embeddings.T
2704
  ```
2705
 
2706
- * **Using HuggingFace Transformers**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2707
 
2708
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
2709
 
@@ -2719,7 +2744,7 @@ model = AutoModel.from_pretrained('BAAI/bge-large-zh')
2719
 
2720
  # Tokenize sentences
2721
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2722
- # for retrieval task, add an instruction to query
2723
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
2724
 
2725
  # Compute token embeddings
@@ -2763,7 +2788,7 @@ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/
2763
 
2764
 
2765
  - **C-MTEB**:
2766
- We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
2767
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
2768
 
2769
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
@@ -2791,7 +2816,7 @@ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagE
2791
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
2792
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
2793
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
2794
- In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
2795
  We used the AdamW optimizer and the learning rate is 2e-5.
2796
 
2797
  **Pre-training data**:
@@ -2800,8 +2825,7 @@ We used the AdamW optimizer and the learning rate is 2e-5.
2800
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
2801
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
2802
  - Chinese:
2803
- - Subset of [wudao](https://github.com/BAAI-WuDao/Data)
2804
- - [baidu-baike](https://baike.baidu.com/)
2805
 
2806
 
2807
  **2. Finetune**
@@ -2815,11 +2839,11 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
2815
  We used the AdamW optimizer and the learning rate is 1e-5.
2816
  The temperature for contrastive loss is 0.01.
2817
 
2818
- For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
2819
- For english, the instruction is `Represent this sentence for searching relevant passages: `;
2820
- For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
2821
- In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
2822
-
2823
 
2824
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
2825
  You can easily finetune your model with it.
@@ -2835,5 +2859,10 @@ You can easily finetune your model with it.
2835
  We will continually update the embedding models and training codes,
2836
  hoping to promote the development of the embedding model community.
2837
 
 
 
2838
  ## License
2839
- FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
 
 
 
 
2604
  ---
2605
 
2606
 
 
2607
  <h1 align="center">FlagEmbedding</h1>
2608
 
2609
 
 
2613
  <a href=#usage>Usage</a> |
2614
  <a href="#evaluation">Evaluation</a> |
2615
  <a href="#train">Train</a> |
2616
+ <a href="#contact">Contact</a> |
2617
  <a href="#license">License</a>
2618
  <p>
2619
  </h4>
2620
 
2621
+ More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
2622
 
2623
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
2624
 
2625
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
2626
+ And it also can be used in vector database for LLMs.
2627
 
2628
  ************* 🌟**Updates**🌟 *************
2629
+ - 08/09/2023: BGE Models are integrated into **Langchain**, you can use it like [**this**](#using-langchain); C-MTEB **leaderboard** is [avaliable](https://huggingface.co/spaces/mteb/leaderboard).
2630
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
2631
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
2632
  - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
2633
 
2634
 
 
2636
 
2637
  `bge` is short for `BAAI general embedding`.
2638
 
2639
+ | Model | Language | Description | query instruction for retrieval\* |
2640
  |:-------------------------------|:--------:| :--------:| :--------:|
2641
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2642
  | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
2643
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
2644
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
2645
  | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
2646
  | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
2647
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
2648
 
2649
+ \*: If you need to search the **long** relevant passages to a **short** query (s2p retrieval task), you need to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** need to be added to passages.
2650
 
2651
  ## Usage
2652
 
2653
+ Here are some examples to use `bge` models with
2654
+ [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
2655
+
2656
+ #### Using FlagEmbedding
2657
  ```
2658
  pip install -U FlagEmbedding
2659
  ```
2660
+ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
2661
 
2662
  ```python
2663
  from FlagEmbedding import FlagModel
2664
  sentences = ["样例数据-1", "样例数据-2"]
2665
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
2666
+ embeddings_1 = model.encode(sentences)
2667
+ embeddings_2 = model.encode(sentences)
2668
+ similarity = embeddings_1 @ embeddings_2.T
2669
+ print(similarity)
2670
 
2671
+ # for s2p(short query to long passage) retrieval task, please use encode_queries() which will automatically add the instruction to each query
2672
+ # corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction
2673
  queries = ['query_1', 'query_2']
2674
+ passages = ["样例文档-1", "样例文档-2"]
2675
  q_embeddings = model.encode_queries(queries)
2676
  p_embeddings = model.encode(passages)
2677
  scores = q_embeddings @ p_embeddings.T
 
2681
  FlagModel will use all available GPUs when encoding, please set `os.environ["CUDA_VISIBLE_DEVICES"]` to choose GPU.
2682
 
2683
 
2684
+ #### Using Sentence-Transformers
2685
 
2686
  Using this model also is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
2687
 
 
2692
  from sentence_transformers import SentenceTransformer
2693
  sentences = ["样例数据-1", "样例数据-2"]
2694
  model = SentenceTransformer('BAAI/bge-large-zh')
2695
+ embeddings_1 = model.encode(sentences, normalize_embeddings=True)
2696
+ embeddings_2 = model.encode(sentences, normalize_embeddings=True)
2697
+ similarity = embeddings_1 @ embeddings_2.T
2698
+ print(similarity)
2699
  ```
2700
+ For s2p(short query to long passage) retrieval task,
2701
+ each short query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
2702
+ But the instruction is not needed for passages.
2703
  ```python
2704
  from sentence_transformers import SentenceTransformer
2705
+ queries = ['query_1', 'query_2']
2706
+ passages = ["样例文档-1", "样例文档-2"]
2707
  instruction = "为这个句子生成表示以用于检索相关文章:"
2708
 
2709
  model = SentenceTransformer('BAAI/bge-large-zh')
 
2712
  scores = q_embeddings @ p_embeddings.T
2713
  ```
2714
 
2715
+ #### Using Langchain
2716
+
2717
+ You can use `bge` in langchain like this:
2718
+ ```python
2719
+ from langchain.embeddings import HuggingFaceBgeEmbeddings
2720
+ model_name = "BAAI/bge-small-en"
2721
+ model_kwargs = {'device': 'cuda'}
2722
+ encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2723
+ model_norm = HuggingFaceBgeEmbeddings(
2724
+ model_name=model_name,
2725
+ model_kwargs=model_kwargs,
2726
+ encode_kwargs=encode_kwargs
2727
+ )
2728
+ ```
2729
+
2730
+
2731
+ #### Using HuggingFace Transformers
2732
 
2733
  With transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of first token (i.e., [CLS]) as the sentence embedding.
2734
 
 
2744
 
2745
  # Tokenize sentences
2746
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2747
+ # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
2748
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
2749
 
2750
  # Compute token embeddings
 
2788
 
2789
 
2790
  - **C-MTEB**:
2791
+ We create a benchmark C-MTEB for chinese text embedding which consists of 31 datasets from 6 tasks.
2792
  Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
2793
 
2794
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
 
2816
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
2817
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
2818
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
2819
+ In retromae, the mask ratio of encoder and decoder are 0.3, 0.5 respectively.
2820
  We used the AdamW optimizer and the learning rate is 2e-5.
2821
 
2822
  **Pre-training data**:
 
2825
  - [wikipedia](https://huggingface.co/datasets/wikipedia)
2826
  - [msmarco](https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus)
2827
  - Chinese:
2828
+ - [wudao](https://github.com/BAAI-WuDao/Data)
 
2829
 
2830
 
2831
  **2. Finetune**
 
2839
  We used the AdamW optimizer and the learning rate is 1e-5.
2840
  The temperature for contrastive loss is 0.01.
2841
 
2842
+ Besides, we add instruction to the query for s2p(short query to long passage) retrieval task in the training (add nothing to passages).
2843
+ For English, the instruction is `Represent this sentence for searching relevant passages: `;
2844
+ For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
2845
+ In the evaluation, the instruction should be added for queries in retrieval task, not be added for other tasks.
2846
+ Noted that the instruction is not needed for passages.
2847
 
2848
  The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
2849
  You can easily finetune your model with it.
 
2859
  We will continually update the embedding models and training codes,
2860
  hoping to promote the development of the embedding model community.
2861
 
2862
+
2863
+
2864
  ## License
2865
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2866
+
2867
+
2868
+