IlyasMoutawwakil (Ilyas Moutawwakil)

Posts 1

Post

4086

Last week, Intel's new Xeon CPUs, Sapphire Rapids (SPR), landed on Inference Endpoints and I think they got the potential to reduce the cost of your RAG pipelines 💸

Why ? Because they come with Intel® AMX support, which is a set of instructions that support and accelerate BF16 and INT8 matrix multiplications on CPU ⚡

I went ahead and built a Space to showcase how to efficiently deploy embedding models on SPR for both Retrieving and Ranking documents, with Haystack compatible components: https://huggingface.co/spaces/optimum-intel/haystack-e2e

Here's how it works:

- Document Store: A FAISS document store containing the seven-wonders dataset, embedded, indexed and stored on the Space's persistent storage to avoid unnecessary re-computation of embeddings.

- Retriever: It embeds the query at runtime and retrieves from the dataset N documents that are most semantically similar to the query's embedding.
We use the small variant of the BGE family here because we want a model that's fast to run on the entire dataset and has a small embedding space for fast similarity search. Specifically we use an INT8 quantized bge-small-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

- Ranker: It re-embeds the retrieved documents at runtime and re-ranks them based on semantic similarity to the query's embedding. We use the large variant of the BGE family here because it's optimized for accuracy allowing us to filter the most relevant k documents that we'll use in the LLM prompt. Specifically we use an INT8 quantized bge-large-en-v1.5, deployed on an Intel Sapphire Rapids CPU instance.

Space: https://huggingface.co/spaces/optimum-intel/haystack-e2e
Retriever IE: optimum-intel/fastrag-retriever
Ranker IE: optimum-intel/fastrag-ranker

View all Posts

Articles 4

Article

13

Accelerating LLM Inference with TGI on Intel Gaudi

View all Articles

models 24

datasets 4

IlyasMoutawwakil/benchmarks

Preview • Updated Dec 12, 2024 • 4

IlyasMoutawwakil/OpenVINO-Benchmarks

Updated Nov 18, 2024 • 23

IlyasMoutawwakil/optimum-benchmarks-ci

Preview • Updated Apr 10, 2024 • 5

IlyasMoutawwakil/llm-race-dataset

Viewer • Updated Nov 23, 2023 • 4.38M • 173 • 1

Ilyas Moutawwakil

AI & ML interests

Recent Activity

Organizations

Posts 1

Articles 4

Accelerating LLM Inference with TGI on Intel Gaudi

models 24

IlyasMoutawwakil/distilgpt2-openvino

IlyasMoutawwakil/flux-onnx-optimum

IlyasMoutawwakil/stable-diffusion-xl-base-1.0-onnx

IlyasMoutawwakil/tiny-stable-diffusion-onnx

IlyasMoutawwakil/stable-diffusion-2-1-onnx

IlyasMoutawwakil/gpt2-openvino

IlyasMoutawwakil/onnx_model_4.42

IlyasMoutawwakil/onnx_model_4.43

IlyasMoutawwakil/fastRAG-bge-int8-static

IlyasMoutawwakil/segformers

datasets 4

IlyasMoutawwakil/benchmarks

IlyasMoutawwakil/OpenVINO-Benchmarks

IlyasMoutawwakil/optimum-benchmarks-ci

IlyasMoutawwakil/llm-race-dataset

Ilyas Moutawwakil

AI & ML interests

Recent Activity

Organizations

Posts 1

Articles 4

Accelerating LLM Inference with TGI on Intel Gaudi

models 24 Sort: Recently updated

datasets 4 Sort: Recently updated

models 24

datasets 4