@Jaward on Hugging Face: "MLX RAG with GGUF Models Minimal, clean code implementation of RAG with mlx…"

Post

1470

MLX RAG with GGUF Models
Minimal, clean code implementation of RAG with mlx inferencing for GGUF models.

Code: https://github.com/Jaykef/mlx-rag-gguf

The code here builds on vegaluisjose's example, it has been optimized to support RAG-based inferencing for .gguf models. I am using BAAI/bge-small-en for the embedding model, tinyllama-1.1b-chat-v1.0.Q4_0.gguf as base model and the custom vector database script for indexing texts in a pdf file. Inference speeds can go up to ~413 tokens/sec for prompts and ~36 tokens/sec for generation on my M2 Air.

Queries make use of both .gguf (base model) and .npz (retrieval model) simultaneouly resulting in much higher inferencing speeds.

Join the conversation