ldwang commited on
Commit
cad5a18
1 Parent(s): b5eb67a
Files changed (1) hide show
  1. README.md +34 -16
README.md CHANGED
@@ -2617,6 +2617,7 @@ language:
2617
  <a href="#evaluation">Evaluation</a> |
2618
  <a href="#train">Train</a> |
2619
  <a href="#contact">Contact</a> |
 
2620
  <a href="#license">License</a>
2621
  <p>
2622
  </h4>
@@ -2630,6 +2631,7 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
2630
  And it also can be used in vector databases for LLMs.
2631
 
2632
  ************* 🌟**Updates**🌟 *************
 
2633
  - 09/12/2023: New Release:
2634
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2635
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
@@ -2664,10 +2666,9 @@ And it also can be used in vector databases for LLMs.
2664
 
2665
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2666
 
2667
- \**: Different embedding model, reranker is a cross-encoder, which cannot be used to generate embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2668
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2669
 
2670
-
2671
  ## Frequently asked questions
2672
 
2673
  <details>
@@ -2730,7 +2731,9 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
2730
  from FlagEmbedding import FlagModel
2731
  sentences_1 = ["样例数据-1", "样例数据-2"]
2732
  sentences_2 = ["样例数据-3", "样例数据-4"]
2733
- model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
 
 
2734
  embeddings_1 = model.encode(sentences_1)
2735
  embeddings_2 = model.encode(sentences_2)
2736
  similarity = embeddings_1 @ embeddings_2.T
@@ -2761,7 +2764,7 @@ pip install -U sentence-transformers
2761
  from sentence_transformers import SentenceTransformer
2762
  sentences_1 = ["样例数据-1", "样例数据-2"]
2763
  sentences_2 = ["样例数据-3", "样例数据-4"]
2764
- model = SentenceTransformer('BAAI/bge-large-zh')
2765
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
2766
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
2767
  similarity = embeddings_1 @ embeddings_2.T
@@ -2776,7 +2779,7 @@ queries = ['query_1', 'query_2']
2776
  passages = ["样例文档-1", "样例文档-2"]
2777
  instruction = "为这个句子生成表示以用于检索相关文章:"
2778
 
2779
- model = SentenceTransformer('BAAI/bge-large-zh')
2780
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
2781
  p_embeddings = model.encode(passages, normalize_embeddings=True)
2782
  scores = q_embeddings @ p_embeddings.T
@@ -2787,7 +2790,7 @@ scores = q_embeddings @ p_embeddings.T
2787
  You can use `bge` in langchain like this:
2788
  ```python
2789
  from langchain.embeddings import HuggingFaceBgeEmbeddings
2790
- model_name = "BAAI/bge-small-en"
2791
  model_kwargs = {'device': 'cuda'}
2792
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2793
  model = HuggingFaceBgeEmbeddings(
@@ -2811,8 +2814,8 @@ import torch
2811
  sentences = ["样例数据-1", "样例数据-2"]
2812
 
2813
  # Load model from HuggingFace Hub
2814
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
2815
- model = AutoModel.from_pretrained('BAAI/bge-large-zh')
2816
  model.eval()
2817
 
2818
  # Tokenize sentences
@@ -2832,6 +2835,7 @@ print("Sentence embeddings:", sentence_embeddings)
2832
 
2833
  ### Usage for Reranker
2834
 
 
2835
  You can get a relevance score by inputting query and passage to the reranker.
2836
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
2837
 
@@ -2841,10 +2845,10 @@ The reranker is optimized based cross-entropy loss, so the relevance score is no
2841
  pip install -U FlagEmbedding
2842
  ```
2843
 
2844
- Get relevance score:
2845
  ```python
2846
  from FlagEmbedding import FlagReranker
2847
- reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True) #use fp16 can speed up computing
2848
 
2849
  score = reranker.compute_score(['query', 'passage'])
2850
  print(score)
@@ -2858,10 +2862,10 @@ print(scores)
2858
 
2859
  ```python
2860
  import torch
2861
- from transformers import AutoModelForSequenceClassification, AutoTokenizer, BatchEncoding, PreTrainedTokenizerFast
2862
 
2863
- tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
2864
- model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
2865
  model.eval()
2866
 
2867
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
@@ -2927,7 +2931,7 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C
2927
  - **Reranking**:
2928
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
2929
 
2930
- | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MmarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
2931
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
2932
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
2933
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
@@ -2940,13 +2944,13 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
2940
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
2941
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
2942
 
2943
- \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval task
2944
 
2945
  ## Train
2946
 
2947
  ### BAAI Embedding
2948
 
2949
- We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning.
2950
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
2951
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
2952
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
@@ -2969,6 +2973,20 @@ If you have any question or suggestion related to this project, feel free to ope
2969
  You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
2970
 
2971
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2972
  ## License
2973
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2974
 
 
2617
  <a href="#evaluation">Evaluation</a> |
2618
  <a href="#train">Train</a> |
2619
  <a href="#contact">Contact</a> |
2620
+ <a href="#citation">Citation</a> |
2621
  <a href="#license">License</a>
2622
  <p>
2623
  </h4>
 
2631
  And it also can be used in vector databases for LLMs.
2632
 
2633
  ************* 🌟**Updates**🌟 *************
2634
+ - 09/15/2023: Release [paper](https://arxiv.org/pdf/2309.07597.pdf) and [dataset](https://data.baai.ac.cn/details/BAAI-MTP).
2635
  - 09/12/2023: New Release:
2636
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
2637
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
 
2666
 
2667
  \*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
2668
 
2669
+ \**: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
2670
  For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
2671
 
 
2672
  ## Frequently asked questions
2673
 
2674
  <details>
 
2731
  from FlagEmbedding import FlagModel
2732
  sentences_1 = ["样例数据-1", "样例数据-2"]
2733
  sentences_2 = ["样例数据-3", "样例数据-4"]
2734
+ model = FlagModel('BAAI/bge-large-zh-v1.5',
2735
+ query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
2736
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2737
  embeddings_1 = model.encode(sentences_1)
2738
  embeddings_2 = model.encode(sentences_2)
2739
  similarity = embeddings_1 @ embeddings_2.T
 
2764
  from sentence_transformers import SentenceTransformer
2765
  sentences_1 = ["样例数据-1", "样例数据-2"]
2766
  sentences_2 = ["样例数据-3", "样例数据-4"]
2767
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
2768
  embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
2769
  embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
2770
  similarity = embeddings_1 @ embeddings_2.T
 
2779
  passages = ["样例文档-1", "样例文档-2"]
2780
  instruction = "为这个句子生成表示以用于检索相关文章:"
2781
 
2782
+ model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
2783
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
2784
  p_embeddings = model.encode(passages, normalize_embeddings=True)
2785
  scores = q_embeddings @ p_embeddings.T
 
2790
  You can use `bge` in langchain like this:
2791
  ```python
2792
  from langchain.embeddings import HuggingFaceBgeEmbeddings
2793
+ model_name = "BAAI/bge-large-en-v1.5"
2794
  model_kwargs = {'device': 'cuda'}
2795
  encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
2796
  model = HuggingFaceBgeEmbeddings(
 
2814
  sentences = ["样例数据-1", "样例数据-2"]
2815
 
2816
  # Load model from HuggingFace Hub
2817
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
2818
+ model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
2819
  model.eval()
2820
 
2821
  # Tokenize sentences
 
2835
 
2836
  ### Usage for Reranker
2837
 
2838
+ Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
2839
  You can get a relevance score by inputting query and passage to the reranker.
2840
  The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
2841
 
 
2845
  pip install -U FlagEmbedding
2846
  ```
2847
 
2848
+ Get relevance scores (higher scores indicate more relevance):
2849
  ```python
2850
  from FlagEmbedding import FlagReranker
2851
+ reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
2852
 
2853
  score = reranker.compute_score(['query', 'passage'])
2854
  print(score)
 
2862
 
2863
  ```python
2864
  import torch
2865
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
2866
 
2867
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
2868
+ model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
2869
  model.eval()
2870
 
2871
  pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
 
2931
  - **Reranking**:
2932
  See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
2933
 
2934
+ | Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
2935
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
2936
  | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
2937
  | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
 
2944
  | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
2945
  | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
2946
 
2947
+ \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks
2948
 
2949
  ## Train
2950
 
2951
  ### BAAI Embedding
2952
 
2953
+ We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pairs data using contrastive learning.
2954
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
2955
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
2956
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
 
2973
  You also can email Shitao Xiao(stxiao@baai.ac.cn) and Zheng Liu(liuzheng@baai.ac.cn).
2974
 
2975
 
2976
+ ## Citation
2977
+
2978
+ If you find our work helpful, please cite us:
2979
+ ```
2980
+ @misc{bge_embedding,
2981
+ title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
2982
+ author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
2983
+ year={2023},
2984
+ eprint={2309.07597},
2985
+ archivePrefix={arXiv},
2986
+ primaryClass={cs.CL}
2987
+ }
2988
+ ```
2989
+
2990
  ## License
2991
  FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
2992