YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Training procedure
This model is fine-tuned using Supervised Fine-Tuning (SFT) with a LoRA (Low-Rank Adaptation) setup on top of Qwen3-VL-8B-Instruct.
The training is part of the Automingo project, which focuses on safety-critical driving VQA using structured, scenario-based reasoning over short temporal image sequences.
Dataset
Training is performed on the Automingo-VQA dataset, designed for structured reasoning in driving scenarios:
- 6,565 images
- 1,313 annotated events
- 5,792 question–answer pairs
- 5-frame temporal snippets centered around safety-critical events
The dataset emphasizes:
- cut-ins
- traffic light transitions
- vulnerable road users
- leading vehicle braking
- construction and lane changes
- intersections and roundabouts
Fine-tuning setup
LoRA adapters are applied to the following modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
Best configuration:
- Learning rate:
2e-4 - LoRA rank:
32 - Gradient accumulation:
8 - Optimizer:
AdamW
Training was conducted using a sweep-based approach to optimize hyperparameters.
Training objective
The model is trained to:
- answer structured driving-related questions
- produce reasoning aligned with safety-critical interpretation
- avoid invalid or non-actionable outputs
Evaluation and results
Evaluation setup
Evaluation is performed using the Automingo benchmark pipeline:
- Multiple-choice question (MCQ) answering
- Post-processing with structured evaluation scripts
Metrics:
- MCQ accuracy
- Invalid attempts
- Semantic score (Lingo-Judge)
Benchmark results
| Model | MCQ Acc. (%) | Invalid Attempts | Lingo-Judge |
|---|---|---|---|
| Qwen3-VL-8B (base) | 81.5 | 9 | 0.556 |
| Automingo-VLM-8B (this model) | 89.3 | 43 | 0.628 |
Key improvements
- +7.8% absolute gain in MCQ accuracy over the base model
- Improved structured reasoning for safety-critical scenarios
- Competitive semantic reasoning performance
Observations
Strong performance on:
- cut-in scenarios
- leading vehicle interactions
Remaining challenges:
- intersections
- roundabouts
Overall, the fine-tuned model achieves strong performance on the Automingo benchmark and demonstrates specialization for ADAS-style reasoning tasks.
base_model: Qwen/Qwen3-VL-8B-Instruct library_name: peft model_name: crisp-sweep-3_lwysu1i9 tags: - base_model:adapter:Qwen/Qwen3-VL-8B-Instruct - lora - sft - transformers - trl licence: license pipeline_tag: text-generation
Model Card for crisp-sweep-3_lwysu1i9
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training procedure
This model was trained with SFT.
Framework versions
- PEFT 0.18.1
- TRL: 0.29.0
- Transformers: 4.57.6
- Pytorch: 2.10.0+cu126
- Datasets: 4.6.0
- Tokenizers: 0.22.2
Citations
Cite Automingo as:
@software{vonwerra2020trl,
title = {{TRL: Transformers Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = {2020}
}
Cite TRL as:
```bibtex
@software{vonwerra2020trl,
title = {{TRL: Transformers Reinforcement Learning}},
author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license = {Apache-2.0},
url = {https://github.com/huggingface/trl},
year = {2020}
}
- Downloads last month
- 2