InternVL2-2B — LoRA fine-tuned on DriveLM (25k + 10% Pseudo-Labels) ⭐ Best model
Part of the Master's thesis:
Visual Question Answering for Autonomous Driving Dmytro Khursenko, Czech Technical University in Prague, Faculty of Electrical Engineering, 2026. Supervised by Ing. David Hurych, Ph.D. (Valeo) and doc. Georgios Tolias, Ph.D. (CTU FEE).
GitHub · Demo · Pseudo-label dataset
📄 Thesis PDF expected by end of June 2026, following successful defense (CTU FEE).
This is the best-performing online checkpoint from the thesis: InternVL2-2B fine-tuned with LoRA on the custom 25k i.i.d. DriveLM split augmented with 10% Qwen3-generated pseudo-labels (~9,900 additional pairs), totalling 35,694 QA pairs. Final Score 0.589 — the best result across all thesis experiments without oracle visual annotations.
This checkpoint contains fully merged weights — the LoRA adapter is already merged into the base model. Load and use it directly with
AutoModel.
All released models
| Model | HF repo | Final |
|---|---|---|
| LoRA-25k | dkhursen/InternVL2-2b-LoRA-25k-drivelm | 0.560 |
| LoRA-300k | dkhursen/InternVL2-2b-LoRA-300k-drivelm | 0.493 |
| LoRA-25k + DL-PL 10% (this model) ⭐ | dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct | 0.589 |
| LoRA-25k + Oracle annotation | dkhursen/InternVL2-2b-LoRA-25k-drivelm-offline-redcircle-ctag-bkgd | 0.775 |
Pseudo-label dataset: dkhursen/drivelm-pseudo-labels
Results on custom test split
Evaluated on the custom i.i.d. DriveLM-nuScenes test split (3,340 QA pairs).
| Metric | Score |
|---|---|
| Final | 0.589 |
| Accuracy | 0.836 |
| ChatGPT | 0.676 |
| Language | 0.451 |
| BLEU-1 | 0.719 |
| BLEU-2 | 0.654 |
| BLEU-3 | 0.589 |
| BLEU-4 | 0.525 |
| ROUGE-L | 0.710 |
| CIDEr | 0.222 |
| Match | 0.304 |
| Coord | 0.013 |
Score definitions (all table values normalised to [0, 1])
- Accuracy — exact-match on MCQ (A/B/C/D) and Yes/No questions; strict letter-only format required
- Language — mean of BLEU-1–4, ROUGE-L, and CIDEr/10
- ChatGPT — GPT-3.5-turbo semantic similarity judge (0–100 scale, ÷100 in Final)
- Match —
(F1_coord × 100 + GPT_match) / 2on prediction answers (0–100 scale, ÷100 in Final); blends spatial coordinate F1 with a GPT judge- Coord — pure coordinate token F1 at L1 < 16 px; diagnostic only, not included in Final
- Final —
0.4 × (GPT/100) + 0.2 × Language + 0.2 × (Match/100) + 0.2 × AccuracyFull metric breakdown: evaluation/README.md
Note on evaluation scope. All scores are measured on the custom i.i.d. local test split (3,340 QA pairs) using the local DriveLM evaluation script. The official DriveLM evaluation server was not used as the primary reporting source: it returns only aggregate scores without per-metric breakdown, the infrastructure was intermittently unreliable (model output parsing errors could not be diagnosed directly — only via GitHub issues), and ChatGPT/Match metrics depend on OpenAI API calls that fail when the API quota for the billing period is exhausted.
Full comparison (all models, online inference):
| Model | Final | Acc | ChatGPT | Lang | B1 | B2 | B3 | B4 | RL | CIDEr | Match | Coord |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mini-DA† | 0.606 | 0.898 | 0.668 | 0.416 | 0.596 | 0.564 | 0.533 | 0.503 | 0.651 | 0.470 | 0.381 | 0.000 |
| LoRA-25k | 0.560 | 0.826 | 0.589 | 0.459 | 0.732 | 0.668 | 0.606 | 0.547 | 0.714 | 0.230 | 0.338 | 0.015 |
| LoRA-25k + DL-PL 10% (this model) | 0.589 | 0.836 | 0.676 | 0.451 | 0.719 | 0.654 | 0.589 | 0.525 | 0.710 | 0.222 | 0.304 | 0.013 |
| LoRA-25k + DL-PL 30% | 0.548 | 0.832 | 0.605 | 0.434 | 0.695 | 0.625 | 0.554 | 0.483 | 0.692 | 0.201 | 0.264 | 0.008 |
| LoRA-25k + DL-PL 50% | 0.532 | 0.832 | 0.584 | 0.433 | 0.691 | 0.622 | 0.551 | 0.481 | 0.699 | 0.171 | 0.230 | 0.007 |
| LoRA-25k + DL-PL 100% | 0.511 | 0.805 | 0.544 | 0.430 | 0.687 | 0.616 | 0.543 | 0.470 | 0.695 | 0.165 | 0.232 | 0.007 |
| LoRA-300k | 0.493 | 0.339 | 0.706 | 0.412 | 0.607 | 0.552 | 0.501 | 0.452 | 0.676 | 0.323 | 0.303 | 0.006 |
† Mini-DA = OpenGVLab/Mini-InternVL2-2B-DA-DriveLM, fine-tuned by OpenGVLab on DriveLM. DL-PL 30%, 50%, 100% variants not publicly released (lower performance without quality filtering).
Training data
Combined training data (35,694 QA pairs total):
1. Custom DriveLM-nuScenes split — 25,825 QA pairs
A scene-level 80/10/10 train/val/test repartition of DriveLM-nuScenes constructed to be distributionally aligned with the test question-type distribution. The official DriveLM split is non-i.i.d. due to QA-template distributional mismatch (~65% of training templates absent from test), causing models trained on it to underperform. See github.com/dmitrykhursen/VQA-AD-CTU for the full split construction methodology.
2. DriveLM pseudo-labels — 10% (~9,900 pairs)
VQA pairs generated by the thesis pipeline from nuScenes sensor priors: 2D/3D bounding boxes, LiDAR-derived per-object distances, and object tracking trajectories, orchestrated by Qwen3 with chain-of-thought reasoning. The full pseudo-label corpus (~100k pairs) is at dkhursen/drivelm-pseudo-labels.
10% augmentation is the optimal mixing ratio — higher ratios degrade performance without quality filtering (30%: 0.548, 50%: 0.532, 100%: 0.511).
Training configuration
| Parameter | Value |
|---|---|
| Base model | OpenGVLab/InternVL2-2B |
| Fine-tuning method | LoRA (merged into weights in this checkpoint) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | all linear layers in the LLM (backbone and MLP frozen) |
| Learning rate | 4e-5 (cosine, warmup 0.03) |
| Effective batch size | 64 (8 GPU × 4 × grad-acc 2) |
| Epochs | 10 |
| Max sequence length | 8192 |
| Precision | bfloat16 |
| Hardware | 8 × NVIDIA A100 40 GB |
Full configuration: configs/finetune/internvl2_2b_lora.yaml
How to use
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModel
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
MODEL_ID = "dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct"
IMAGE_PATH = "path/to/stitched_6camera.jpg"
QUESTION = "<image>\nWhat are the important objects in the current scene? Those objects will be considered for the future reasoning and driving decision."
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
return T.Compose([
T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float("inf")
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
ratio_diff = abs(aspect_ratio - ratio[0] / ratio[1])
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
target_ratios = sorted(
{(i, j)
for n in range(min_num, max_num + 1)
for i in range(1, n + 1)
for j in range(1, n + 1)
if min_num <= i * j <= max_num},
key=lambda x: x[0] * x[1],
)
best_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size
)
target_w = image_size * best_ratio[0]
target_h = image_size * best_ratio[1]
cols = best_ratio[0]
resized = image.resize((target_w, target_h))
tiles = []
for i in range(best_ratio[0] * best_ratio[1]):
col = i % cols
row = i // cols
box = (col * image_size, row * image_size,
(col + 1) * image_size, (row + 1) * image_size)
tiles.append(resized.crop(box))
if use_thumbnail and len(tiles) != 1:
tiles.append(image.resize((image_size, image_size)))
return tiles
def load_image(image_path, input_size=448, max_num=12):
image = Image.open(image_path).convert("RGB")
transform = build_transform(input_size)
tiles = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
return torch.stack([transform(tile) for tile in tiles])
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).eval().cuda()
pixel_values = load_image(IMAGE_PATH, max_num=12).to(torch.bfloat16).cuda()
generation_config = {"max_new_tokens": 512, "do_sample": False}
response = model.chat(tokenizer, pixel_values, QUESTION, generation_config)
print("Answer:", response)
Batch inference and evaluation (requires cloning the repo):
git clone https://github.com/dmitrykhursen/VQA-AD-CTU
cd VQA-AD-CTU
bash scripts/setup_env.sh && source vqa-ad-ctu-env/bin/activate
# Run inference (edit MODEL= inside the script)
sbatch scripts/05_inference.sh
# Evaluate
bash scripts/06_evaluate.sh inference/outputs/<MODEL>/local_test.json
Example predictions
Input: stitched 6-camera panoramic view (CAM_FRONT, CAM_FRONT_LEFT, CAM_FRONT_RIGHT, CAM_BACK, CAM_BACK_LEFT, CAM_BACK_RIGHT) from a single nuScenes keyframe.
| Category | Question | Ground Truth | This model | Pretrained InternVL2-2B |
|---|---|---|---|---|
| perception | What is the moving status of <c1,CAM_FRONT,441.7,485.0>? A. Turn right. B. Going ahead. C. Stopped. D. Drive backward. |
B | B | C. Stopped. |
| prediction | Is <c1,CAM_FRONT,441.7,485.0> a traffic sign or a road barrier? |
No. | No. | To determine if the coordinates <c1,CAM_FRONT,441.7,485.0> represent a traffic sign or a road barrier, let's analyze the image closely. The coordinates are given in the format (x, y, z) … Therefore, the coordinates <c1,CAM_FRONT,441.7,485.0> do not represent a traffic sign or road barrier. |
| planning | What actions could the ego vehicle take based on <c1,CAM_FRONT,441.7,485.0>? |
The action is to brake gently to a stop because there is no safety issue, with a high probability. | The action is to keep going at the same speed. The reason is that there is no safety issue. The probability is high. | The ego vehicle, indicated by the camera's position (C1, CAM_FRONT, 441.7, 485.0), is positioned at the intersection of a two-way street. Based on this position, the ego vehicle could take several actions: … In summary, the ego vehicle has a high probability of continuing straight through the intersection, making a left turn, or continuing straight through the intersection. |
| behavior | Predict the behavior of the ego vehicle. A. Going straight, slowly. B. Steering left, very fast. C. Steering right, slowly. D. Steering left, normal speed. | A | A | C. Steering right, slowly. |
Browse 25 sampled scenes: dmitrykhursen.github.io/VQA-AD-CTU
Intended use and limitations
Intended use:
- Autonomous driving VQA scene understanding
- Benchmarking on DriveLM
- Starting point for further fine-tuning
Limitations:
- Visual object localization is the dominant bottleneck: providing ground-truth annotations at test time raises Final Score from 0.589 to 0.775 — near-zero Coord (0.013) confirms the model struggles with precise spatial grounding
- No quality filtering was applied to pseudo-labels — every generated QA pair is used at face value; this artificially caps the usable augmentation ratio, as the degrading performance at 30–100% mixing confirms
- The pseudo-label generation templates have an ego-centric forward camera bias, disproportionately referencing objects in the forward field of view; this contributes to the performance drop at higher mixing ratios
- Not intended for deployment in real autonomous driving systems
- Trained and evaluated on nuScenes scenes from Boston and Singapore only
Resources
| Resource | Link |
|---|---|
| Code & scripts | github.com/dmitrykhursen/VQA-AD-CTU |
| Demo gallery | dmitrykhursen.github.io/VQA-AD-CTU |
| Pseudo-label dataset | dkhursen/drivelm-pseudo-labels |
| Base model | OpenGVLab/InternVL2-2B |
| DriveLM benchmark | github.com/OpenDriveLab/DriveLM |
| nuScenes dataset | nuscenes.org |
Citation
@mastersthesis{khursenko2026vqa,
author = {Khursenko, Dmytro},
title = {Visual Question Answering for Autonomous Driving},
school = {Czech Technical University in Prague, Faculty of Electrical Engineering},
year = {2026},
supervisor = {Hurych, David and Tolias, Georgios}
}
@inproceedings{sima2024drivelm,
title = {DriveLM: Driving with Graph Visual Question Answering},
author = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and others},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2024}
}
- Downloads last month
- 117
Model tree for dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct
Base model
OpenGVLab/InternVL2-2B