Cyrile commited on
Commit
9acfe3b
1 Parent(s): 8bc5035

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -3
README.md CHANGED
@@ -1,13 +1,50 @@
1
  ---
2
  license: bigscience-bloom-rail-1.0
3
- datasets:
4
- - squad
5
  language:
6
  - fr
7
  - en
8
  pipeline_tag: sentence-similarity
9
  ---
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ```python
12
  import numpy as np
13
  from transformers import pipeline
@@ -22,7 +59,10 @@ list_of_queries = [...]
22
  emb_queries = np.concatenate(infer(list_of_queries), axis=0)
23
 
24
  dist = cdist(emb_queries, emb_contexts, 'euclidean')
25
- top_k = lambda x: [[list_of_contexts[qq] for qq in ii] for ii in dist.argsort(axis=-1)[:,:x]]
 
 
 
26
  # top 5 nearest contexts for each queries
27
  top_contexts = top_k(5)
28
  ```
 
1
  ---
2
  license: bigscience-bloom-rail-1.0
 
 
3
  language:
4
  - fr
5
  - en
6
  pipeline_tag: sentence-similarity
7
  ---
8
 
9
+ Blommz-3b-retriever
10
+ ---------------------
11
+
12
+ Introducing Bloomz-3b-retriever based on the [Bloomz-3b-sft-chat model](https://huggingface.co/cmarkea/bloomz-3b-sft-chat). This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.
13
+
14
+ ![embedding](https://i.postimg.cc/L6KC7tvw/embedding.png)
15
+
16
+ Training
17
+ --------
18
+
19
+ It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined in [Deep Metric Learning using Triplet Network](https://arxiv.org/abs/1412.6622).
20
+
21
+ Benchmark
22
+ ---------
23
+
24
+ Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query, the standard deviation of the average top, and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).
25
+
26
+ Model (FR/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
27
+ |-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
28
+ | TF-IDF | 128 | 269 | 23 | 46 | 56 |
29
+ | [CamemBERT](https://huggingface.co/camembert/camembert-base) | 417 | 347 | 1 | 2 | 3 |
30
+ | [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 11 | 41 | 43 | 71 | 82 |
31
+ | Bloomz-560m-retriever | 10 | 47 | 51 | 78 | 86 |
32
+ | Bloomz-3b-retriever | 9 | 37 | 50 | 79 | 87 |
33
+
34
+ Model (EN/FR) | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
35
+ |-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
36
+ | TF-IDF | 607 | 334 | 0 | 0 | 0 |
37
+ | [CamemBERT](https://huggingface.co/camembert/camembert-base) | 432 | 345 | 0 | 1 | 1 |
38
+ | [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12 | 47 | 44 | 73 | 83 |
39
+ | [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) | 10 | 44 | 49 | 77 | 86 |
40
+ | [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) | 9 | 38 | 50 | 78 | 87 |
41
+
42
+
43
+ How to Use Blommz-3b-retriever
44
+ --------------------------------
45
+
46
+ The following example utilizes the API Pipeline of the Transformers library.
47
+
48
  ```python
49
  import numpy as np
50
  from transformers import pipeline
 
59
  emb_queries = np.concatenate(infer(list_of_queries), axis=0)
60
 
61
  dist = cdist(emb_queries, emb_contexts, 'euclidean')
62
+ top_k = lambda x: [
63
+ [list_of_contexts[qq] for qq in ii]
64
+ for ii in dist.argsort(axis=-1)[:,:x]
65
+ ]
66
  # top 5 nearest contexts for each queries
67
  top_contexts = top_k(5)
68
  ```