Open-Source AI Cookbook documentation

Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs)

Authored by: Sergio Paniego

🚨 WARNING: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.

In this notebook, we demonstrate how to build a Multimodal Retrieval-Augmented Generation (RAG) system by combining the ColPali retriever for document retrieval with the Qwen2-VL Vision Language Model (VLM). Together, these models form a powerful RAG system capable of enhancing query responses with both text-based documents and visual data.

Instead of relying on a complex document processor pipeline that extracts data through OCR, we will leverage a Document Retrieval Model to efficiently retrieve the relevant documents based on a specific user query.

I also recommend checking out and starring the smol-vision repository, which inspired this notebook—especially this notebook. For an introduction to RAG, you can check this other cookbook!