Training procedure

This model is fine-tuned using Supervised Fine-Tuning (SFT) with a LoRA (Low-Rank Adaptation) setup on top of Qwen3-VL-8B-Instruct.

The training is part of the Automingo project, which focuses on safety-critical driving VQA using structured, scenario-based reasoning over short temporal image sequences.

Dataset

Training is performed on the Automingo-VQA dataset, designed for structured reasoning in driving scenarios:

6,565 images
1,313 annotated events
5,792 question–answer pairs
5-frame temporal snippets centered around safety-critical events

The dataset emphasizes:

cut-ins
traffic light transitions
vulnerable road users
leading vehicle braking
construction and lane changes
intersections and roundabouts

Fine-tuning setup

LoRA adapters are applied to the following modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj

Best configuration:

Learning rate: 2e-4
LoRA rank: 32
Gradient accumulation: 8
Optimizer: AdamW

Training was conducted using a sweep-based approach to optimize hyperparameters.

Training objective

The model is trained to:

answer structured driving-related questions
produce reasoning aligned with safety-critical interpretation
avoid invalid or non-actionable outputs

Evaluation and results

Evaluation setup

Evaluation is performed using the Automingo benchmark pipeline:

Multiple-choice question (MCQ) answering
Post-processing with structured evaluation scripts

Metrics:

MCQ accuracy
Invalid attempts
Semantic score (Lingo-Judge)

Benchmark results

Model	MCQ Acc. (%)	Invalid Attempts	Lingo-Judge
Qwen3-VL-8B (base)	81.5	9	0.556
Automingo-VLM-8B (this model)	89.3	43	0.628

Key improvements

+7.8% absolute gain in MCQ accuracy over the base model
Improved structured reasoning for safety-critical scenarios
Competitive semantic reasoning performance

Observations

Strong performance on:
- cut-in scenarios
- leading vehicle interactions
Remaining challenges:
- intersections
- roundabouts

Overall, the fine-tuned model achieves strong performance on the Automingo benchmark and demonstrates specialization for ADAS-style reasoning tasks.

base_model: Qwen/Qwen3-VL-8B-Instruct library_name: peft model_name: crisp-sweep-3_lwysu1i9 tags: - base_model:adapter:Qwen/Qwen3-VL-8B-Instruct - lora - sft - transformers - trl licence: license pipeline_tag: text-generation

Model Card for crisp-sweep-3_lwysu1i9

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with SFT.

Framework versions

PEFT 0.18.1
TRL: 0.29.0
Transformers: 4.57.6
Pytorch: 2.10.0+cu126
Datasets: 4.6.0
Tokenizers: 0.22.2

Citations

Cite Automingo as:

@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}

Cite TRL as:
    
```bibtex
@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}

Downloads last month: 2

Safetensors

Model size

9B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support