Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("ColPali") | |
st.success("""[Original tweet](https://x.com/mervenoyann/status/1811003265858912670) (Jul 10, 2024)""", icon="ℹ️") | |
st.markdown(""" """) | |
st.markdown(""" | |
Forget any document retrievers, use ColPali 💥💥 | |
Document retrieval is done through OCR + layout detection, but it's overkill and doesn't work well! 🤓 | |
ColPali uses a vision language model, which is better in doc understanding 📑 | |
""") | |
st.markdown(""" """) | |
st.image("pages/ColPali/image_1.png", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Check out [ColPali model](https://huggingface.co/vidore/colpali) (mit license!) | |
Check out the [blog](https://huggingface.co/blog/manu/colpali) | |
The authors also released a new benchmark for document retrieval, [ViDoRe Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard), submit your model! """) | |
st.markdown(""" """) | |
st.image("pages/ColPali/image_2.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Regular document retrieval systems use OCR + layout detection + another model to retrieve information from documents, and then use output representations in applications like RAG 🥲 | |
Meanwhile modern image encoders demonstrate out-of-the-box document understanding capabilities!""") | |
st.markdown(""" """) | |
st.markdown(""" | |
ColPali marries the idea of modern vision language models with retrieval 🤝 | |
The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali 🖇️ | |
""") | |
st.markdown(""" """) | |
st.image("pages/ColPali/image_3.png", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🤩 | |
""") | |
st.markdown(""" """) | |
st.markdown(""" | |
The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet. | |
See below how every model and ColPali performs on ViDoRe 👇🏻 | |
""") | |
st.markdown(""" """) | |
st.image("pages/ColPali/image_4.jpeg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown("""Aside from performance improvements, ColPali is very fast for offline indexing as well! | |
""") | |
st.markdown(""" """) | |
st.image("pages/ColPali/image_5.png", use_column_width=True) | |
st.markdown(""" """) | |
st.info(""" | |
Resources: | |
- [ColPali: Efficient Document Retrieval with Vision Language Models](https://huggingface.co/papers/2407.01449) | |
by Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (2024) | |
- [GitHub](https://github.com/illuin-tech/colpali) | |
- [Link to Models](https://huggingface.co/models?search=vidore) | |
- [Link to Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard)""", icon="📚") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("MiniGemini") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("Home") |