VLMPed-CoT

A LoRA fine-tuned version of Qwen2.5-VL-3B-Instruct for pedestrian crossing intention prediction, trained with Chain-of-Thought supervision.

This model is part of the ECE 228 final project at UCSD (Spring 2026): "How Do Vision Language Models Utilize Multi-Frame Temporal Information for Pedestrian Intention Prediction?"

Project repository: ece228_VLMPed-CoT


Model Details

  • Developed by: chiawen0104
  • Model type: Vision-Language Model (LoRA fine-tuned)
  • Finetuned from: Qwen/Qwen2.5-VL-3B-Instruct
  • Task: Pedestrian crossing intention prediction (binary: cross / not cross)
  • Training datasets: JAAD, PIE
  • Framework: PEFT 0.15.1

How to Get Started

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct"
)
model = PeftModel.from_pretrained(base_model, "chiawen0104/VLMPed-CoT")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

Training Details

  • Base model: Qwen2.5-VL-3B-Instruct
  • Fine-tuning method: LoRA (via PEFT)
  • Training regime: bf16 mixed precision
  • Training data: JAAD and PIE pedestrian crossing intention datasets
  • CoT supervision: โœ… Chain-of-Thought reasoning generated via Gemini API

Intended Use

This model takes multi-frame pedestrian images as input and predicts whether a pedestrian intends to cross the street. The CoT supervision encourages the model to reason step-by-step before making a prediction. It is intended for research purposes in autonomous driving and pedestrian behavior analysis.


Differences from VLMPed-wo-CoT

VLMPed-CoT VLMPed-wo-CoT
CoT supervision โœ… โŒ
Direct prediction โœ… โœ…

Reference

Framework versions

  • PEFT 0.15.1
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for chiawen0104/VLMPed-CoT

Adapter
(198)
this model