File size: 4,823 Bytes
cef4116 541d3a5 6b0b9bb 519623a a52756e 519623a 541d3a5 519623a 830021d 519623a 6b0b9bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.
The model was proposed in [LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://huggingface.co/papers/2411.10440).
## Model Details
<!-- Provide a longer summary of what this model is. -->
- **License:** apache-2.0
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
## Benchmark Results
| MMStar | MMBench | MMVet | MathVista | AI2D | Hallusion | Average |
|--------|---------|-------|-----------|------|-----------|---------|
| 57.6 | 75.0 | 60.3 | 54.8 | 85.7 | 47.8 | 63.5 |
## Reproduction
<!-- This section describes the evaluation protocols and provides the results. -->
To reproduce our results, you should use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and the following settings.
| Parameter | Value |
|-------------------|---------|
| do_sample | True |
| temperature | 0.6 |
| top_p | 0.9 |
| max_new_tokens | 2048 |
You may change them in [this file](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/llama_vision.py), line 80-83, and modify the max_new_tokens throughout the file.
Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048.
After you get the results, you should filter the model output and only **keep the outputs between \<CONCLUSION\> and \</CONCLUSION\>**.
This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes.
By keeping the outputs between \<CONCLUSION\> and \</CONCLUSION\>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.
## How to Get Started with the Model
You can use the inference code for Llama-3.2-11B-Vision-Instruct.
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model is trained on the LLaVA-o1-100k dataset (to be released).
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes) with the following settings.
Using the same setting should accurately reproduce our results.
| Parameter | Value |
|-------------------------------|---------------------------------------------------|
| FSDP | enabled |
| lr | 1e-5 |
| num_epochs | 3 |
| batch_size_training | 4 |
| use_fast_kernels | True |
| run_validation | False |
| batching_strategy | padding |
| context_length | 4096 |
| gradient_accumulation_steps | 1 |
| gradient_clipping | False |
| gradient_clipping_threshold | 1.0 |
| weight_decay | 0.0 |
| gamma | 0.85 |
| seed | 42 |
| use_fp16 | False |
| mixed_precision | True |
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data.
Technically, the model's performance in aspects like instruction following still falls short of leading industry models. |