Intel
/

bge-large-en-v1.5-rag-int8-static

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

peterizsak commited on Jan 3, 2024

Commit

2e43ad1

·

1 Parent(s): f393266

Update README.md

Files changed (1) hide show

README.md +86 -1

README.md CHANGED Viewed

@@ -2,4 +2,89 @@
 license: mit
 language:
 - en
----

 license: mit
 language:
 - en
+---
+# BGE-large-en-v1.5-rag-int8-static
+A quantized version of [BAAI/BGE-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) embedder compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel) and [Intel® Neural Compressor](https://github.com/huggingface/optimum-intel).
+The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as an embedder/ranker model as part of [fastRAG](https://github.com/IntelLabs/fastRAG).
+See [model page](https://huggingface.co/BAAI/bge-large-en-v1.5) for full details on model architecture and training details.
+## Technical details
+Quantized using post-training static quantization.
+|  |  |
+|---|:---:|
+| Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 100 random samples)" |
+| Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) |
+| Backend | `IPEX` |
+| Original model | [BAAI/BGE-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) |
+Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders).
+## Evaluation - MTEB
+|  | `INT8` | `FP32` | % diff |
+|---|:---:|:---:|:---:|
+| Reranking | 0.5997 | 0.6003 | -0.108% |
+## Usage
+### Using with Optimum-intel
+See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run:
+``` sh
+pip install -U optimum[neural-compressor] intel-extension-for-transformers
+```
+Loading a model:
+``` python
+from optimum.intel import INCModel
+model = INCModel.from_pretrained("Intel/bge-large-en-v1.5-rag-int8-static")
+```
+Running inference:
+``` python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Intel/bge-large-en-v1.5-rag-int8-static")
+inputs = tokenizer(sentences, return_tensors='pt')
+with torch.no_grad():
+    outputs = model(**inputs)
+    # get the vector of [CLS]
+    embedded = model_output[0][:, 0]
+```
+### Using with a fastRAG RAG pipeline
+Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG).
+Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline.
+``` python
+from fastrag.rankers import QuantizedBiEncoderRanker
+ranker = QuantizedBiEncoderRanker("Intel/bge-large-en-v1.5-rag-int8-static")
+```
+and plugging it into a pipeline
+``` python
+from haystack import Pipeline
+p = Pipeline()
+p.add_node(component=retriever, name="retriever", inputs=["Query"])
+p.add_node(component=ranker, name="ranker", inputs=["retriever"])
+```
+See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).