Image-Text-to-Text
Transformers
TensorBoard
Safetensors
feature-extraction
conversational
custom_code
xiangan nielsr HF Staff commited on
Commit
967d049
Β·
verified Β·
1 Parent(s): 152c522

Improve model card: add prominent links, comprehensive guides, and project details (#2)

Browse files

- Improve model card: add prominent links, comprehensive guides, and project details (ff264893f327294b99bd0c59896d42002f48150d)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +303 -45
README.md CHANGED
@@ -1,67 +1,68 @@
1
  ---
2
- license: apache-2.0
3
- datasets:
4
- - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
5
- - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
6
  base_model:
7
  - DeepGlint-AI/rice-vit-large-patch14-560
8
  - Qwen/Qwen3-4B-Instruct-2507
9
- pipeline_tag: image-text-to-text
 
 
10
  library_name: transformers
 
 
11
  ---
 
12
  # LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model
13
 
14
- # ✨ Key Features
15
 
16
- **LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
17
 
18
- 1. **Superior Performance**
19
- A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks.
20
 
21
- 2. **High-Quality Data at Scale**
22
- Meticulously curated **mid-training and SFT data** with rigorous filtering and quality control.
23
- - Concept-balanced, highly diverse, high-quality caption data
24
- - Comprehensive instruction fine-tuning data covering a wide range of tasks
25
 
26
- 3. **Ultra-Efficient Training Framework**
27
- Complete end-to-end training framework designed for maximum efficiency:
28
- - **$16K total budget** for full model training
29
- - **45% HFU efficiency** on A100 GPUs ($0.6 per GPU/Hour)
30
- - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
31
- - Optimized codebase for cost-effective scaling
32
 
33
- 4. **Fully Open Framework** for community access and reproducibility:
34
- - βœ… High-quality mid-training & SFT data
35
- - βœ… Complete training framework & code
36
- - βœ… Training recipes & configurations
37
- - βœ… Base & instruct model checkpoints
38
- - βœ… Comprehensive training logs & metrics
39
 
40
- ## Code
41
- This model is trained using a fully open-source, end-to-end training framework, with all code available at [EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5).
 
 
 
 
42
 
 
 
 
 
 
 
43
 
44
  ## Models
45
 
46
- | Model | HF Link | Training Log |
47
- |--------------------------|--------------------------------------------------------------------------------------------------------|-------------|
48
- | LLaVA-OV-1.5-4B-Instruct | [πŸ€— HF / 4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard) |
49
- | LLaVA-OV-1.5-8B-Instruct | [πŸ€— HF / 8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) |
50
-
51
 
52
  ## Dataset
 
53
  | Description | Link |
54
- |-------------|------|
55
  | Mid-training data for LLaVA-OneVision-1.5 | [πŸ€— Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) |
56
  | SFT data for LLaVA-OneVision-1.5 | [πŸ€— Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) |
57
 
58
  ## Evaluation Results
59
  All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
60
 
61
-
62
  ![image](https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/J8oBYmQkTOC6pBNLgJn9d.png)
63
 
64
- ### Using πŸ€— Transformers to Chat
 
65
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
66
 
67
  ```python
@@ -115,20 +116,260 @@ output_text = processor.batch_decode(
115
  print(output_text)
116
  ```
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ## Citation
119
 
120
  If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
121
 
122
  ```
123
- @misc{an2025llavaonevision15fullyopenframework,
124
- title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
125
- author={Xiang An and Yin Xie and Kaicheng Yang and Wenkang Zhang and Xiuwei Zhao and Zheng Cheng and Yirui Wang and Songcen Xu and Changrui Chen and Chunsheng Wu and Huajie Tan and Chunyuan Li and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
126
- year={2025},
127
- eprint={2509.23661},
128
- archivePrefix={arXiv},
129
- primaryClass={cs.CV},
130
- url={https://arxiv.org/abs/2509.23661},
131
- }
132
 
133
  @inproceedings{xie2025region,
134
  title={Region-based Cluster Discrimination for Visual Representation Learning},
@@ -143,4 +384,21 @@ If you find *LLaVA-OneVision-1.5* useful in your research, please consider to ci
143
  journal={Transactions on Machine Learning Research}
144
  year={2024}
145
  }
146
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
2
  base_model:
3
  - DeepGlint-AI/rice-vit-large-patch14-560
4
  - Qwen/Qwen3-4B-Instruct-2507
5
+ datasets:
6
+ - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
7
+ - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
8
  library_name: transformers
9
+ license: apache-2.0
10
+ pipeline_tag: image-text-to-text
11
  ---
12
+
13
  # LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model
14
 
15
+ This is the official Hugging Face model card for **LLaVA-OneVision-1.5**, a novel family of Large Multimodal Models (LMMs) presented in the paper [LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training](https://huggingface.co/papers/2509.23661).
16
 
17
+ πŸ“š [Paper](https://huggingface.co/papers/2509.23661) | πŸ’» [Code](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5) | 🏠 [Project Page](https://huggingface.co/collections/lmms-lab/llava-onevision-15-68d385fe73b50bd22de23713) | πŸš€ [Demo](https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5)
18
 
19
+ # ✨ Key Features
 
20
 
21
+ **LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
 
 
 
22
 
23
+ 1. **Superior Performance**
24
+ A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks.
 
 
 
 
25
 
26
+ 2. **High-Quality Data at Scale**
27
+ Meticulously curated **mid-training and SFT data** with rigorous filtering and quality control.
28
+ - Concept-balanced, highly diverse, high-quality caption data
29
+ - Comprehensive instruction fine-tuning data covering a wide range of tasks
 
 
30
 
31
+ 3. **Ultra-Efficient Training Framework**
32
+ Complete end-to-end training framework designed for maximum efficiency:
33
+ - **$16K total budget** for full model training
34
+ - **45% HFU efficiency** on A100 GPUs ($0.6 per GPU/Hour)
35
+ - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
36
+ - Optimized codebase for cost-effective scaling
37
 
38
+ 4. **Fully Open Framework** for community access and reproducibility:
39
+ - βœ… High-quality mid-training & SFT data
40
+ - βœ… Complete training framework & code
41
+ - βœ… Training recipes & configurations
42
+ - βœ… Base & instruct model checkpoints
43
+ - βœ… Comprehensive training logs & metrics
44
 
45
  ## Models
46
 
47
+ | Model | HF Link | Training Log |
48
+ |---|---|---|
49
+ | LLaVA-OV-1.5-4B-Instruct | [πŸ€— HF / 4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct/tensorboard) |
50
+ | LLaVA-OV-1.5-8B-Instruct | [πŸ€— HF / 8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | [πŸ“ˆ Tensorboard](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct/tensorboard) |
 
51
 
52
  ## Dataset
53
+
54
  | Description | Link |
55
+ |---|---|
56
  | Mid-training data for LLaVA-OneVision-1.5 | [πŸ€— Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) |
57
  | SFT data for LLaVA-OneVision-1.5 | [πŸ€— Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) |
58
 
59
  ## Evaluation Results
60
  All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
61
 
 
62
  ![image](https://cdn-uploads.huggingface.co/production/uploads/655c70d331c4978366d4b2e6/J8oBYmQkTOC6pBNLgJn9d.png)
63
 
64
+ ## Quick Start with HuggingFace
65
+
66
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
67
 
68
  ```python
 
116
  print(output_text)
117
  ```
118
 
119
+ ## Evaluation
120
+ ```
121
+ # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
122
+
123
+ accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
124
+ --model=llava_onevision1_5 \
125
+ --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
126
+ --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
127
+ --batch_size=1
128
+ ```
129
+
130
+ ## Quick Start Guide
131
+
132
+ ### 1.🐳 Docker (Recommended)
133
+
134
+ We strongly recommend using the docker environment for a seamless experience. The following instructions are tailored for the A100 80GB GPU environment.
135
+
136
+
137
+ ```bash
138
+ # Clone repository
139
+ git clone https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5.git
140
+ cd LLaVA-OneVision-1.5
141
+
142
+ docker build -t llava_megatron:25.04 .
143
+
144
+ # Run container with -w to set working directory directly to the mounted volume
145
+ docker run -it --gpus all \
146
+ --ipc host --net host --privileged --cap-add IPC_LOCK \
147
+ --ulimit memlock=-1 --ulimit stack=67108864 --rm \
148
+ -v $(pwd):/workspace/LLaVA-OneVision-1.5 \
149
+ -w /workspace/LLaVA-OneVision-1.5 \
150
+ --name "llava_megatron_container" \
151
+ llava_megatron:25.04 /bin/bash
152
+ ```
153
+
154
+ ### 2. Checkpoint and Format Conversion
155
+
156
+ You have two options to get started with LLaVA-OneVision-1.5-stage-0:
157
+
158
+ #### Option 1: Download pre-trained model from HuggingFace
159
+ Download our `LLaVA-OneVision-1.5-4B-stage0` model directly from [HuggingFace](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-stage0).
160
+
161
+ #### Option 2: Merge initial weights yourself
162
+ Alternatively, you can merge the initial weights from the original ViT and LLM:
163
+ ```bash
164
+ python ds/merge_model.py \
165
+ --vit_path DeepGlint-AI/rice-vit-large-patch14-560 \
166
+ --llm_path Qwen/Qwen3-4B-Instruct-2507 \
167
+ --output LLaVA-OneVision-1.5-4B-stage0
168
+ ```
169
+ Note: When merging weights, the adapter component will be initialized with default values.
170
+
171
+ Convert the model from HuggingFace format to Megatron format:
172
+
173
+ ```bash
174
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 bash examples/llava_ov_1_5/convert/convert_4b_hf_to_mcore.sh \
175
+ LLaVA-OneVision-1.5-4B-stage0 \
176
+ LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \
177
+ 1 1
178
+ ```
179
+
180
+ ### 3. Stage 1 Alignment-Training
181
+
182
+ Download LLaVA from [LLaVA-558K-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-558K-Webdataset).
183
+
184
+
185
+ ```bash
186
+ # ============================================================
187
+ # Required environment variables:
188
+ # AIAK_TRAINING_PATH Root directory of the AIAK-Training-LLM project
189
+ # DATA_PATH Directory with WebDataset shards (.tar) for pretraining
190
+ # TOKENIZER_PATH Hugging Face tokenizer directory
191
+ # CHECKPOINT_PATH Megatron-formatted checkpoint directory (e.g., mcore TP1/PP1)
192
+ # SAVE_CKPT_PATH Output directory for saving training checkpoints
193
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
194
+ DATA_PATH=LLaVA-558K-Webdataset \
195
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
196
+ CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 \
197
+ bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh
198
+ ```
199
+
200
+ ### 4. Stage 1.5 Mid-Training
201
+
202
+ Download our lightweight packed subset from [LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Mid-Training-Webdataset-Quick-Start-3M).
203
+
204
+ ```bash
205
+ # ============================================================
206
+ # Convert model to release format
207
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \
208
+ stage_1_alignment_llava_ov_4b/iter_0002500/ \
209
+ stage_1_alignment_llava_ov_4b_release 1 1
210
+ # ============================================================
211
+ # Launch
212
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
213
+ DATA_PATH=LLaVA-OneVision-1.5-Mid-Training-Quick-Start-3M-Webdataset \
214
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
215
+ CHECKPOINT_PATH=stage_1_alignment_llava_ov_4b_release \
216
+ bash examples/llava_ov_1_5/quick_start/stage_1.5_mid_training_llava_ov_4b.sh
217
+ ```
218
+
219
+
220
+ ### 5. Stage 2 Instruct-Training
221
+
222
+ Download LLaVA-NeXT-780k-webdataset at [LLaVA-NeXT-780K Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-780k-webdataset).
223
+
224
+ ```bash
225
+ # ============================================================
226
+ # Convert model to release format
227
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_release.sh \
228
+ stage_1.5_mid_training_llava_ov_4b/iter_0020000/ \
229
+ stage_1.5_mid_training_llava_ov_4b_release 1 1
230
+ # ============================================================
231
+ # # Launch
232
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
233
+ DATA_PATH=LLaVA-NeXT-780k-Webdataset \
234
+ TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 \
235
+ CHECKPOINT_PATH=stage_1.5_mid_training_llava_ov_4b_release \
236
+ bash examples/llava_ov_1_5/quick_start/stage_2_instruct_llava_ov_4b.sh
237
+ ```
238
+
239
+
240
+ ### 6. Convert mcore to huggingface
241
+ ```bash
242
+ AIAK_TRAINING_PATH=/workspace/LLaVA-OneVision-1.5 \
243
+ bash examples/llava_ov_1_5/convert/convert_4b_mcore_to_hf.sh \
244
+ stage_2_instruct_llava_ov_4b/iter_0003500 \
245
+ LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct \
246
+ 1 1
247
+ # Copy non-model files (e.g., tokenizer config) to the new directory
248
+ find LLaVA-OneVision-1.5-4B-stage0/ -type f -not -iname '*safetensors*' -exec cp {} LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct/ ';'
249
+ ```
250
+
251
+ ### 7. Evaluation
252
+ ```bash
253
+ # pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
254
+ CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch \
255
+ --num_processes=4 --main_process_port 12399 -m lmms_eval --model=llava_onevision1_5 --batch_size=1 --tasks=mme \
256
+ --model_args=pretrained=/workspace/LLaVA-OneVision-1.5/LLaVA-OneVision-1.5-4B-3M-Mid-Training-780K-Instruct,max_pixels=3240000
257
+ ```
258
+
259
+ ## Fully Reproducing Guide
260
+
261
+ > [!TIP]
262
+ > More detailed reproduction steps for the complete process will be provided after the dataset upload is completed.
263
+
264
+
265
+ ### Mid-Training
266
+
267
+ To improve model training efficiency, we implement offline sample packing:
268
+
269
+ 1. Download the [**Mid-Training-85M Dataset**](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M)
270
+ 2. Pack the data into webdataset format, refer to [**Examples offlinepacking**](examples_offline_packing) and [**Offline Padding-Free Data Packing**](examples/llava_ov_1_5/sample_packing/README.md)
271
+
272
+
273
+ ### Instruct
274
+ 1. Download the [**LLaVA-OneVision-1.5-Insturct-Data**](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data)
275
+ 2. Convert the data into webdataset format, refer to [**Conversion for Mixed Instruction Data**](docs/sft_data_preprocessing.md)
276
+
277
+ ## Roadmaps
278
+
279
+ Q4 2025 Key Deliverables:
280
+
281
+ 1. **Ultra-efficient MoE Training**
282
+ 2. **Full Video Input LLM**
283
+
284
+
285
+ ## Contributors
286
+ Thanks so much to all of our amazing contributors!
287
+
288
+ <!-- readme: collaborators,contributors,jiankangdeng/- -start -->
289
+ <table>
290
+ <tbody>
291
+ <tr>
292
+ <td align="center">
293
+ <a href="https://github.com/fdcp">
294
+ <img src="https://avatars.githubusercontent.com/u/15667917?v=4" width="80;" alt="fdcp"/>
295
+ <br />
296
+ <sub><b>fdcp</b></sub>
297
+ </a>
298
+ </td>
299
+ <td align="center">
300
+ <a href="https://github.com/anxiangsir">
301
+ <img src="https://avatars.githubusercontent.com/u/31175974?v=4" width="80;" alt="anxiangsir"/>
302
+ <br />
303
+ <sub><b>anxiangsir</b></sub>
304
+ </a>
305
+ </td>
306
+ <td align="center">
307
+ <a href="https://github.com/yiyexy">
308
+ <img src="https://avatars.githubusercontent.com/u/35927125?v=4" width="80;" alt="yiyexy"/>
309
+ <br />
310
+ <sub><b>yiyexy</b></sub>
311
+ </a>
312
+ </td>
313
+ <td align="center">
314
+ <a href="https://github.com/wideyard">
315
+ <img src="https://avatars.githubusercontent.com/u/101321826?v=4" width="80;" alt="wideyard"/>
316
+ <br />
317
+ <sub><b>wideyard</b></sub>
318
+ </a>
319
+ </td>
320
+ <td align="center">
321
+ <a href="https://github.com/chengzheng345">
322
+ <img src="https://avatars.githubusercontent.com/u/209475443?v=4" width="80;" alt="chengzheng345"/>
323
+ <br />
324
+ <sub><b>chengzheng345</b></sub>
325
+ </a>
326
+ </td>
327
+ <td align="center">
328
+ <a href="https://github.com/killTheHostage">
329
+ <img src="https://avatars.githubusercontent.com/u/16442720?v=4" width="80;" alt="killTheHostage"/>
330
+ <br />
331
+ <sub><b>killTheHostage</b></sub>
332
+ </a>
333
+ </td>
334
+ <td align="center">
335
+ <a href="https://github.com/mathCrazyy">
336
+ <img src="https://avatars.githubusercontent.com/u/20607153?v=4" width="80;" alt="mathCrazyy"/>
337
+ <br />
338
+ <sub><b>mathCrazyy</b></sub>
339
+ </a>
340
+ </td>
341
+ <td align="center">
342
+ <a href="https://github.com/yunglechao">
343
+ <img src="https://avatars.githubusercontent.com/u/7631185?v=4" width="80;" alt="yunglechao"/>
344
+ <br />
345
+ <sub><b>yunglechao</b></sub>
346
+ </a>
347
+ </td>
348
+ </tr>
349
+ <tr>
350
+ <td align="center">
351
+ <a href="https://github.com/RobitYadda">
352
+ <img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="80;" alt="RobitYadda"/>
353
+ <br />
354
+ <sub><b>RobitYadda</b></sub>
355
+ </a>
356
+ </td>
357
+ </tr>
358
+ <tbody>
359
+ </table>
360
+ <!-- readme: collaborators,contributors,jiankangdeng/- -end -->
361
+
362
  ## Citation
363
 
364
  If you find *LLaVA-OneVision-1.5* useful in your research, please consider to cite the following related papers:
365
 
366
  ```
367
+ @inproceedings{LLaVA-OneVision-1.5,
368
+ title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
369
+ author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
370
+ booktitle={arxiv},
371
+ year={2025}
372
+ }
 
 
 
373
 
374
  @inproceedings{xie2025region,
375
  title={Region-based Cluster Discrimination for Visual Representation Learning},
 
384
  journal={Transactions on Machine Learning Research}
385
  year={2024}
386
  }
387
+ ```
388
+
389
+ ## Acknowledgement
390
+
391
+ We extend our sincere gratitude to **AIAK team of the** [**Baige AI computing platform**](https://cloud.baidu.com/product/aihc.html) **from Baidu AI Cloud** for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. `To get full AIAK support, you can contact Baidu Cloud.`
392
+
393
+
394
+ We also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research:
395
+
396
+ - LLaVA: Large Language-and-Vision Assistant β€” [LLaVA](https://github.com/haotian-liu/LLaVA)
397
+ - LLaVA-NeXT: Next-generation multi-modal assistant β€” [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
398
+ - lmms-eval: A standardized evaluation framework for Large Multimodal Models β€” [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)
399
+ - Megatron-LM: Efficient, scalable training for large language models β€” [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
400
+ - Qwen2.5-VL: Strong vision-language foundation model β€” [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
401
+ - InternVL: Open-source large-scale vision-language foundation model β€” [InternVL](https://github.com/OpenGVLab/InternVL)
402
+ - Qwen3: Next-generation Qwen LLM β€” [Qwen](https://github.com/QwenLM/Qwen)
403
+ - MetaCLIP: Scalable contrastive pretraining β€” [MetaCLIP](https://github.com/facebookresearch/MetaCLIP)
404
+ - FineVision: Open Data Is All You Need β€” [FineVision](https://huggingface.co/spaces/HuggingFaceM4/FineVision)