Spaces:
Running
Running
# ColPali: Efficient Document Retrieval with Vision Language Models | |
[[Blog]](https://huggingface.co/blog/manu/colpali) | |
[[Paper]](https://arxiv.org/abs/2407.01449) | |
[[ColPali Model card]](https://huggingface.co/vidore/colpali) | |
[[ViDoRe Benchmark]](https://huggingface.co/vidore) | |
<!---[[Colab example]]()--> | |
[[HuggingFace Demo]](https://huggingface.co/spaces/manu/ColPali-demo) | |
## Associated Paper | |
**ColPali: Efficient Document Retrieval with Vision Language Models** | |
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo | |
This repository contains the code for training custom Colbert retriever models. | |
Notably, we train colbert with LLMs (decoders) as well as Image Language models ! | |
## Installation | |
### From git | |
```bash | |
pip install git+https://github.com/illuin-tech/colpali | |
``` | |
### From source | |
```bash | |
git clone https://github.com/illuin-tech/colpali | |
mv colpali | |
pip install -r requirements.txt | |
``` | |
## Usage | |
Example usage of the model is shown in the `scripts` directory. | |
```bash | |
# hackable example script to adapt | |
python scripts/infer/run_inference_with_python.py | |
``` | |
```python | |
import torch | |
import typer | |
from torch.utils.data import DataLoader | |
from tqdm import tqdm | |
from transformers import AutoProcessor | |
from PIL import Image | |
from colpali_engine.models.paligemma_colbert_architecture import ColPali | |
from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator | |
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries | |
from colpali_engine.utils.image_from_page_utils import load_from_dataset | |
def main() -> None: | |
"""Example script to run inference with ColPali""" | |
# Load model | |
model_name = "vidore/colpali" | |
model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=torch.bfloat16, device_map="cuda").eval() | |
model.load_adapter(model_name) | |
processor = AutoProcessor.from_pretrained(model_name) | |
# select images -> load_from_pdf(<pdf_path>), load_from_image_urls(["<url_1>"]), load_from_dataset(<path>) | |
images = load_from_dataset("vidore/docvqa_test_subsampled") | |
queries = ["From which university does James V. Fiorca come ?", "Who is the japanese prime minister?"] | |
# run inference - docs | |
dataloader = DataLoader( | |
images, | |
batch_size=4, | |
shuffle=False, | |
collate_fn=lambda x: process_images(processor, x), | |
) | |
ds = [] | |
for batch_doc in tqdm(dataloader): | |
with torch.no_grad(): | |
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} | |
embeddings_doc = model(**batch_doc) | |
ds.extend(list(torch.unbind(embeddings_doc.to("cpu")))) | |
# run inference - queries | |
dataloader = DataLoader( | |
queries, | |
batch_size=4, | |
shuffle=False, | |
collate_fn=lambda x: process_queries(processor, x, Image.new("RGB", (448, 448), (255, 255, 255))), | |
) | |
qs = [] | |
for batch_query in dataloader: | |
with torch.no_grad(): | |
batch_query = {k: v.to(model.device) for k, v in batch_query.items()} | |
embeddings_query = model(**batch_query) | |
qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) | |
# run evaluation | |
retriever_evaluator = CustomEvaluator(is_multi_vector=True) | |
scores = retriever_evaluator.evaluate(qs, ds) | |
print(scores.argmax(axis=1)) | |
if __name__ == "__main__": | |
typer.run(main) | |
``` | |
Detais are also given in the model card for the base Colpali model on HuggingFace: [ColPali Model card](https://huggingface.co/vidore/colpali). | |
## Training | |
```bash | |
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/siglip/train_siglip_model_debug.yaml | |
``` | |
or | |
```bash | |
accelerate launch scripts/train/train_colbert.py scripts/configs/train_colidefics_model.yaml | |
``` | |
### Configurations | |
All training arguments can be set through a configuration file. | |
The configuration file is a yaml file that contains all the arguments for training. | |
The construction is as follows: | |
```python | |
@dataclass | |
class ColModelTrainingConfig: | |
model: PreTrainedModel | |
tr_args: TrainingArguments = None | |
output_dir: str = None | |
max_length: int = 256 | |
run_eval: bool = True | |
run_train: bool = True | |
peft_config: Optional[LoraConfig] = None | |
add_suffix: bool = False | |
processor: Idefics2Processor = None | |
tokenizer: PreTrainedTokenizer = None | |
loss_func: Optional[Callable] = ColbertLoss() | |
dataset_loading_func: Optional[Callable] = None | |
eval_dataset_loader: Optional[Dict[str, Callable]] = None | |
pretrained_peft_model_name_or_path: Optional[str] = None | |
``` | |
### Example | |
An example configuration file is: | |
```yaml | |
config: | |
(): colpali_engine.utils.train_colpali_engine_models.ColModelTrainingConfig | |
output_dir: !path ../../../models/without_tabfquad/train_colpali-3b-mix-448 | |
processor: | |
() : colpali_engine.utils.wrapper.AutoProcessorWrapper | |
pretrained_model_name_or_path: "./models/paligemma-3b-mix-448" | |
max_length: 50 | |
model: | |
(): colpali_engine.utils.wrapper.AutoColModelWrapper | |
pretrained_model_name_or_path: "./models/paligemma-3b-mix-448" | |
training_objective: "colbertv1" | |
# attn_implementation: "eager" | |
torch_dtype: !ext torch.bfloat16 | |
# device_map: "auto" | |
# quantization_config: | |
# (): transformers.BitsAndBytesConfig | |
# load_in_4bit: true | |
# bnb_4bit_quant_type: "nf4" | |
# bnb_4bit_compute_dtype: "bfloat16" | |
# bnb_4bit_use_double_quant: true | |
dataset_loading_func: !ext colpali_engine.utils.dataset_transformation.load_train_set | |
eval_dataset_loader: !import ../data/test_data.yaml | |
max_length: 50 | |
run_eval: true | |
add_suffix: true | |
loss_func: | |
(): colpali_engine.loss.colbert_loss.ColbertPairwiseCELoss | |
tr_args: !import ../tr_args/default_tr_args.yaml | |
peft_config: | |
(): peft.LoraConfig | |
r: 32 | |
lora_alpha: 32 | |
lora_dropout: 0.1 | |
init_lora_weights: "gaussian" | |
bias: "none" | |
task_type: "FEATURE_EXTRACTION" | |
target_modules: '(.*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$|.*(custom_text_proj).*$)' | |
# target_modules: '(.*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$|.*(custom_text_proj).*$)' | |
``` | |
#### Local training | |
```bash | |
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/siglip/train_siglip_model_debug.yaml | |
``` | |
#### SLURM | |
```bash | |
sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1 -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/train_colidefics_model.yaml" | |
sbatch --nodes=1 --time=5:00:00 -A cad15443 --gres=gpu:8 --constraint=MI250 --job-name=colpali --wrap="python scripts/train/train_colbert.py scripts/configs/train_colpali_model.yaml" | |
``` | |
## CITATION | |
```bibtex | |
@misc{faysse2024colpaliefficientdocumentretrieval, | |
title={ColPali: Efficient Document Retrieval with Vision Language Models}, | |
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, | |
year={2024}, | |
eprint={2407.01449}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.IR}, | |
url={https://arxiv.org/abs/2407.01449}, | |
} | |
``` |