Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Mar 26
Post
1344
MLX RAG with GGUF Models
Minimal, clean code implementation of RAG with mlx inferencing for GGUF models.

Code: https://github.com/Jaykef/mlx-rag-gguf

The code here builds on vegaluisjose's example, it has been optimized to support RAG-based inferencing for .gguf models. I am using BAAI/bge-small-en for the embedding model, tinyllama-1.1b-chat-v1.0.Q4_0.gguf as base model and the custom vector database script for indexing texts in a pdf file. Inference speeds can go up to ~413 tokens/sec for prompts and ~36 tokens/sec for generation on my M2 Air.

Queries make use of both .gguf (base model) and .npz (retrieval model) simultaneouly resulting in much higher inferencing speeds.
In this post