ColPali marries the idea of modern vision language models with retrieval ๐ค
The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali ๐๏ธ BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) ๐คฉ
The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet. ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!