Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveΒ 
posted an update Sep 5
Post
3815
If you have documents that do not only have text and you're doing retrieval or RAG (using OCR and LLMs), give it up and give ColPali and vision language models a try πŸ€—

Why? Documents consist of multiple modalities: layout, table, text, chart, images. Document processing pipelines often consist of multiple models and they're immensely brittle and slow. πŸ₯²

How? ColPali is a ColBERT-like document retrieval model built on PaliGemma, it operates over image patches directly, and indexing takes far less time with more accuracy. You can use it for retrieval, and if you want to do retrieval augmented generation, find the closest document, and do not process it, give it directly to a VLM like Qwen2-VL (as image input) and give your text query. 🀝

This is much faster + you do not lose out on any information + much easier to maintain too! πŸ₯³

Multimodal RAG merve/multimodal-rag-66d97602e781122aae0a5139 πŸ’¬
Document AI (made it way before, for folks who want structured input/output and can fine-tune a model) merve/awesome-document-ai-65ef1cdc2e97ef9cc85c898e πŸ“–

@merve This and your related posts were so inspiring, I wrote up a related piece. But not sure how to show this on HF? πŸ€” It's at the URL below, but not sure where it shows on the site. It may be useful for others to read, if they have a similar problem. What do you think? (Also, not sure where to ask this, that's why I just ask in the comment here.) https://huggingface.co/blog/fsommers/document-similarity-colpali

This comment has been hidden