Update README.md

8d832ab verified 16 days ago

No virus

6.28 kB

	---
	license: bigscience-bloom-rail-1.0
	language:
	- fr
	- en
	pipeline_tag: feature-extraction
	---

	# Note


	We now strongly recommend using the [Bloomz-560m-retriever-v2](https://huggingface.co/cmarkea/bloomz-560m-retriever-v2) model, which offers significantly superior performance.

	Bloomz-560m-retriever
	---------------------

	We introduce the Bloomz-560m-retriever based on the [Bloomz-560m-sft-chat](https://huggingface.co/cmarkea/bloomz-560m-sft-chat) model. This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.

	![embedding](https://i.postimg.cc/L6KC7tvw/embedding.png)

	Training
	--------

	It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined in [Deep Metric Learning using Triplet Network](https://arxiv.org/abs/1412.6622).

	Benchmark
	---------

	Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query (Top-mean), the standard deviation of the average top (Top-std), and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset (we want a fixed algebraic structure for the vector database instead of a variable structure every time we add a new document, then the IDF part has frozen), CamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).

	Model (FR/FR) \| Top-mean \| Top-std \| Top-1 (%) \| Top-5 (%) \| Top-10 (%) \|
	\|----------------------------------------------------------------------------------------------------:\|:--------:\|:-------:\|:---------:\|:---------:\|:----------:\|
	\| TF-IDF \| 128 \| 269 \| 23 \| 46 \| 56 \|
	\| [CamemBERT](https://huggingface.co/camembert/camembert-base) \| 417 \| 347 \| 1 \| 2 \| 3 \|
	\| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) \| 11 \| 41 \| 43 \| 71 \| 82 \|
	\| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) \| 10 \| 47 \| 51 \| 78 \| 86 \|
	\| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) \| 9 \| 37 \| 50 \| 79 \| 87 \|

	Model (EN/FR) \| Top-mean \| Top-std \| Top-1 (%) \| Top-5 (%) \| Top-10 (%) \|
	\|----------------------------------------------------------------------------------------------------:\|:--------:\|:-------:\|:---------:\|:---------:\|:----------:\|
	\| TF-IDF \| 607 \| 334 \| 0 \| 0 \| 0 \|
	\| [CamemBERT](https://huggingface.co/camembert/camembert-base) \| 432 \| 345 \| 0 \| 1 \| 1 \|
	\| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) \| 12 \| 47 \| 44 \| 73 \| 83 \|
	\| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever) \| 10 \| 44 \| 49 \| 77 \| 86 \|
	\| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever) \| 9 \| 38 \| 50 \| 78 \| 87 \|

	We observed that TF-IDF loses robustness in cross-language scenarios (even showing lower performance than CamemBERT, which is a model specialized in French). This can be explained by the fact that a Bag-Of-Words method cannot support this type of issue because, for a given sentence between two languages, the embedding vectors will be significantly different.

	CamemBERT exhibits poor performance, not because it poorly groups contexts and queries by themes, but because a meta-cluster appears, separating contexts and queries (as illustrated in the image below), making this type of modeling inappropriate in a retriever context.

	![embedding_camembert](https://i.postimg.cc/x1fZMhFK/emb-camembert.png)

	How to Use Bloomz-560m-retriever
	--------------------------------

	The following example utilizes the API Pipeline of the Transformers library.

	```python
	import numpy as np
	from transformers import pipeline
	from scipy.spatial.distance import cdist

	retriever = pipeline('feature-extraction', 'cmarkea/bloomz-560m-retriever')

	# Inportant: take only last token!
	infer = lambda x: [ii[0][-1] for ii in retriever(x)]

	list_of_contexts = [...]
	emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)
	list_of_queries = [...]
	emb_queries = np.concatenate(infer(list_of_queries), axis=0)

	# Important: take l2 distance!
	dist = cdist(emb_queries, emb_contexts, 'euclidean')
	top_k = lambda x: [
	[list_of_contexts[qq] for qq in ii]
	for ii in dist.argsort(axis=-1)[:,:x]
	]

	# top 5 nearest contexts for each queries
	top_contexts = top_k(5)
	```

	Citation
	--------

	```bibtex
	@online{DeBloomzRet,
	AUTHOR = {Cyrile Delestre},
	ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
	URL = {https://huggingface.co/cmarkea/bloomz-560m-retriever},
	YEAR = {2023},
	KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
	}
	```