ahmed-masry commited on
Commit
afe420d
1 Parent(s): f33c59b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -3
README.md CHANGED
@@ -1,3 +1,124 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models
6
+
7
+ In June 2024, [ColPali](https://arxiv.org/abs/2407.01449) was introduced as an OCR-free document retrieval model, built over [PaliGemma](https://arxiv.org/abs/2407.07726), shifting the paradigm of PDF document retrieval by directly processing images instead of using error-prone and resource-heavy OCR pipelines. However, with three billion parameters, ColPali might be computationally expensive, especially for large document databases. In contrast, text retrieval models like [ColBERT](https://arxiv.org/abs/2004.12832) are more efficient with just a few hundred million parameters, but they require error-prone and expensive OCR pipelines to. To bridge this gap, we introduce ColFlor, an OCR-free visual document retrieval model with only 130 million parameters.
8
+
9
+ <p align="center"><img width=800 src="https://github.com/illuin-tech/colpali/blob/main/assets/colflor.png?raw=true"/></p>
10
+
11
+ More details about the model can be found in the [ColFlor blogpost](https://huggingface.co/blog/ahmed-masry/colflor)
12
+
13
+ ## Usage
14
+
15
+ First, you need to clone the github repo and install the dependencies as follows
16
+
17
+
18
+ ```bash
19
+ git clone https://github.com/AhmedMasryKU/colflor
20
+ cd colflor
21
+ pip install . -e
22
+ ```
23
+
24
+ Then, you can run the following inference code:
25
+
26
+ ```python
27
+ import torch
28
+ import typer
29
+ from torch.utils.data import DataLoader
30
+ from tqdm import tqdm
31
+ from transformers import AutoProcessor
32
+ from PIL import Image
33
+
34
+ from colpali_engine.models.paligemma_colbert_architecture import ColPali
35
+ from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator
36
+ from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
37
+ from colpali_engine.utils.image_from_page_utils import load_from_dataset
38
+
39
+
40
+ def main() -> None:
41
+ """Example script to run inference with ColPali"""
42
+
43
+ # Load model
44
+ model_name = "vidore/colpali"
45
+ model = ColPali.from_pretrained("vidore/colpaligemma-3b-mix-448-base", torch_dtype=torch.bfloat16, device_map="cuda").eval()
46
+ model.load_adapter(model_name)
47
+ processor = AutoProcessor.from_pretrained(model_name)
48
+
49
+ # select images -> load_from_pdf(<pdf_path>), load_from_image_urls(["<url_1>"]), load_from_dataset(<path>)
50
+ images = load_from_dataset("vidore/docvqa_test_subsampled")
51
+ queries = ["From which university does James V. Fiorca come ?", "Who is the japanese prime minister?"]
52
+
53
+ # run inference - docs
54
+ dataloader = DataLoader(
55
+ images,
56
+ batch_size=4,
57
+ shuffle=False,
58
+ collate_fn=lambda x: process_images(processor, x),
59
+ )
60
+ ds = []
61
+ for batch_doc in tqdm(dataloader):
62
+ with torch.no_grad():
63
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
64
+ embeddings_doc = model(**batch_doc)
65
+ ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
66
+
67
+ # run inference - queries
68
+ dataloader = DataLoader(
69
+ queries,
70
+ batch_size=4,
71
+ shuffle=False,
72
+ collate_fn=lambda x: process_queries(processor, x, Image.new("RGB", (448, 448), (255, 255, 255))),
73
+ )
74
+
75
+ qs = []
76
+ for batch_query in dataloader:
77
+ with torch.no_grad():
78
+ batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
79
+ embeddings_query = model(**batch_query)
80
+ qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))
81
+
82
+ # run evaluation
83
+ retriever_evaluator = CustomEvaluator(is_multi_vector=True)
84
+ scores = retriever_evaluator.evaluate(qs, ds)
85
+ print(scores.argmax(axis=1))
86
+
87
+
88
+ if __name__ == "__main__":
89
+ typer.run(main)
90
+
91
+ ```
92
+
93
+ ## Limitations
94
+
95
+ - **Figures**: While ColFlor exhibits reasonable performance on figures, there's a relatively large gap in performance between it and larger models such as ColPali.
96
+ - **Multilinguality**: The current version of the model only supports the Engligh language and performs poorly on other languages.
97
+
98
+ ## License
99
+
100
+ We release this model under the MIT license.
101
+
102
+ ## Contact
103
+
104
+ If you have any questions about this work, feel free to reach out to **Ahmed Masry** at **masry20@yorku.ca** or **ahmed.elmasry24653@gmail.com**.
105
+
106
+ ## Acknowledgement
107
+ This work was carried out at the Intelligent Visualization Lab at York University in Canada. It was supported by the Natural Sciences Engineering Research Council (NSERC) of Canada and Canada Foundation for Innovation (CFI). Additionally, it received support through a GCP credits award from Google's PaliGemma Academic Program.
108
+
109
+ We appreciate the well-documented training and evaluation GitHub repositories provided by the ColPali team, which were instrumental in our model development.
110
+ This model card is adapted from [ColPali Model Card](https://huggingface.co/vidore/colpali)
111
+
112
+ ## Citation
113
+
114
+ If you plan to use ColFlor in your research, please consider citing us as follows:
115
+
116
+ ```bibtex
117
+ @misc{masry2024colflor,
118
+ title={ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models},
119
+ url={https://huggingface.co/blog/ahmed-masry/colflor},
120
+ author={Masry, Ahmed},
121
+ month={October},
122
+ year={2024}
123
+ }
124
+ ```