Update README.md

#2
by merve HF Staff - opened
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ datasets:
6
+ - BLIP3o/BLIP3o-60k
7
+ - BLIP3o/BLIP3o-Pretrain-Short-Caption
8
+ - BLIP3o/BLIP3o-Pretrain-Long-Caption
9
+ pipeline_tag: any-to-any
10
+ library_name: diffusers
11
+ ---
12
+ # 🌌 BLIP3-o
13
+
14
+ BLIP3-o is a unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. Unlike prior works that diffuse VAE features or raw pixels, BLIP3-o diffuses semantically rich **CLIP image features**, enabling a powerful and efficient architecture for both image understanding and generation.
15
+
16
+ ## πŸ“– [Arxiv](http://arxiv.org/abs/2505.09568)
17
+
18
+ ## Update
19
+
20
+ - [2025/05/19] πŸ”₯ We understand this is a large codebase, we shared a high-level overview of its [Code Structure](https://github.com/JiuhaiChen/BLIP3o/issues/11#issuecomment-2891930000), feel free to open an issue if you encounter any problems.
21
+ - [2025/05/16] πŸ”₯ We’ve published a dataset of 20 million images with detailed captions [BLIP3o Pretrain Long Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) and 4 million images with short caption [BLIP3o Pretrain Short Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Short-Caption). All images and their captions are compressed into tar archives, **no separate image url downloads or manual unzipping required**.
22
+ - [2025/05/16] πŸ”₯ We’ve reorganized and cleaned up the repository to ensure a clear, well-structured codebase. Please give the training and inference scripts a try, and feel free to leave an issue if you run into any problems. We apologize for any confusion caused by our original codebase release.
23
+
24
+
25
+ ## ✨ Highlights
26
+
27
+ - **Fully Open-Source:** Fully open-source training data (Pretraining and Instruction Tuning), training recipe, model weights, code.
28
+ - **Unified Architecture:** for both image understanding and generation.
29
+ - **CLIP Feature Diffusion:** Directly diffuses semantic vision features for stronger alignment and performance.
30
+ - **State-of-the-art performance:** across a wide range of image understanding and generation benchmarks.
31
+
32
+
33
+ ---
34
+
35
+ ## Demo
36
+ You can try out BLIP3-o in your browser using our interactive [Demo](https://huggingface.co/spaces/BLIP3o/blip-3o).
37
+
38
+ Install package for training
39
+ ```Shell
40
+ conda create -n blip3o python=3.11 -y
41
+ conda activate blip3o
42
+ pip install --upgrade pip setuptools
43
+ pip install -r requirements.txt
44
+ ```
45
+
46
+ ## Inference
47
+
48
+ You can clone our GitHub repository
49
+ ```bash
50
+ git clone https://github.com/JiuhaiChen/BLIP3o.git
51
+ ```
52
+ Download our checkpoint
53
+ ```bash
54
+ python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model', repo_type='model'))"
55
+ ```
56
+
57
+ and run the inference code
58
+
59
+ ```python
60
+ from diffusers import DiffusionPipeline
61
+ from transformers import AutoProcessor, AutoModel
62
+ import torch
63
+ from blip3o.constants import *
64
+ from blip3o.conversation import conv_templates, SeparatorStyle
65
+ from blip3o.model.builder import load_pretrained_model
66
+ from blip3o.conversation import conv_templates, SeparatorStyle
67
+ from transformers import AutoProcessor
68
+
69
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
70
+ tokenizer, multi_model, context_len = load_pretrained_model(model_path, None, model_name)
71
+
72
+ pipe = DiffusionPipeline.from_pretrained(
73
+ diffusion_path,
74
+ custom_pipeline="pipeline_llava_gen",
75
+ torch_dtype=torch.bfloat16,
76
+ use_safetensors=True,
77
+ variant="bf16",
78
+ multimodal_encoder=multi_model,
79
+ tokenizer=tokenizer,
80
+ safety_checker=None
81
+ )
82
+
83
+ pipe.vae.to("cuda")
84
+ pipe.unet.to("cuda")
85
+
86
+ def add_template(prompt):
87
+ conv = conv_templates['qwen'].copy()
88
+ conv.append_message(conv.roles[0], prompt[0])
89
+ conv.append_message(conv.roles[1], None)
90
+ prompt = conv.get_prompt()
91
+ return [prompt]
92
+
93
+ prompt = "A photo of cute cat"
94
+ gen_img = pipe(add_template([f"Please generate image based on the following caption: {prompt}"]), guidance_scale=3.0)
95
+ ```
96
+
97
+ ## Training
98
+ We include two scripts: **slurm.sh** for multi-node training on Slurm clusters, and **run.sh** for debugging.
99
+
100
+ For both **slurm.sh** and **run.sh**, you need to import huggingface home **HF_HOME**, training data folder **IMG_FOLDER** and output model save folder **OUTPUT_FOLDER**.
101
+
102
+ For our open source model training, we combine the pretraining dataset, including both long and short captions, images from JourneyDB. You can download [JourneyDB](https://huggingface.co/datasets/JourneyDB/JourneyDB). When training the diffusion transformer from scratch, we recommend using a large number of training steps along with a cosine annealing learning rate schedule that decays from 1Γ—10⁻⁴ down to 1Γ—10⁻⁡.
103
+
104
+
105
+ ## CLIP + Diffusion (Encoder + Decoder)
106
+ We also provide two CLIP + Diffusion:
107
+
108
+ [EVA-CLIP + SDXL]: The model checkpoint already includes the diffusion decoder [diffusion-decoder](https://huggingface.co/BLIP3o/BLIP3o-Model/tree/main/diffusion-decoder). The EVA-CLIP vision tower weights can be downloaded here [EVA-CLIP](https://huggingface.co/jiuhai/eva_clip_vision_tower), the preprocess of EVA-CLIP is in the training code [EVA-CLIP-preprocess](https://github.com/JiuhaiChen/BLIP3o/tree/main/blip3o/model/multimodal_encoder/eva_clip).
109
+
110
+ [SigLIP + SANA]: [coming soon]
111
+
112
+
113
+
114
+ ## Supported Tasks
115
+
116
+ - **Text β†’ Text**
117
+ - **Image β†’ Text** (Image Understanding)
118
+ - **Text β†’ Image** (Image Generation)
119
+ - **Image β†’ Image** (Image Editing)
120
+ - **Multitask Training** (Image generation and undetstanding mix training)
121
+
122
+
123
+ ## Supported Image Generation Methods
124
+
125
+ - **CLIP + MSE**
126
+ - **CLIP + Flow Matching**
127
+ - **VAE + Flow Matching**
128
+ - **Transfusion, LMFusion**
129
+
130
+
131
+
132
+ ## Supported Autoregressive Backbones
133
+
134
+ - **Qwen-2.5-VL**
135
+ - **LLaMA 3**
136
+
137
+ We suggest to use Qwen-2.5-VL as the backbone, we are fixing some tokenizer issues for LLama3.