InternVL2-2B — LoRA fine-tuned on DriveLM (25k + 10% Pseudo-Labels) ⭐ Best model

Part of the Master's thesis:

Visual Question Answering for Autonomous Driving Dmytro Khursenko, Czech Technical University in Prague, Faculty of Electrical Engineering, 2026. Supervised by Ing. David Hurych, Ph.D. (Valeo) and doc. Georgios Tolias, Ph.D. (CTU FEE).

GitHub · Demo · Pseudo-label dataset

📄 Thesis PDF expected by end of June 2026, following successful defense (CTU FEE).


This is the best-performing online checkpoint from the thesis: InternVL2-2B fine-tuned with LoRA on the custom 25k i.i.d. DriveLM split augmented with 10% Qwen3-generated pseudo-labels (~9,900 additional pairs), totalling 35,694 QA pairs. Final Score 0.589 — the best result across all thesis experiments without oracle visual annotations.

This checkpoint contains fully merged weights — the LoRA adapter is already merged into the base model. Load and use it directly with AutoModel.


All released models

Model HF repo Final
LoRA-25k dkhursen/InternVL2-2b-LoRA-25k-drivelm 0.560
LoRA-300k dkhursen/InternVL2-2b-LoRA-300k-drivelm 0.493
LoRA-25k + DL-PL 10% (this model) ⭐ dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct 0.589
LoRA-25k + Oracle annotation dkhursen/InternVL2-2b-LoRA-25k-drivelm-offline-redcircle-ctag-bkgd 0.775

Pseudo-label dataset: dkhursen/drivelm-pseudo-labels


Results on custom test split

Evaluated on the custom i.i.d. DriveLM-nuScenes test split (3,340 QA pairs).

Metric Score
Final 0.589
Accuracy 0.836
ChatGPT 0.676
Language 0.451
BLEU-1 0.719
BLEU-2 0.654
BLEU-3 0.589
BLEU-4 0.525
ROUGE-L 0.710
CIDEr 0.222
Match 0.304
Coord 0.013

Score definitions (all table values normalised to [0, 1])

  • Accuracy — exact-match on MCQ (A/B/C/D) and Yes/No questions; strict letter-only format required
  • Language — mean of BLEU-1–4, ROUGE-L, and CIDEr/10
  • ChatGPT — GPT-3.5-turbo semantic similarity judge (0–100 scale, ÷100 in Final)
  • Match(F1_coord × 100 + GPT_match) / 2 on prediction answers (0–100 scale, ÷100 in Final); blends spatial coordinate F1 with a GPT judge
  • Coord — pure coordinate token F1 at L1 < 16 px; diagnostic only, not included in Final
  • Final0.4 × (GPT/100) + 0.2 × Language + 0.2 × (Match/100) + 0.2 × Accuracy

Full metric breakdown: evaluation/README.md

Note on evaluation scope. All scores are measured on the custom i.i.d. local test split (3,340 QA pairs) using the local DriveLM evaluation script. The official DriveLM evaluation server was not used as the primary reporting source: it returns only aggregate scores without per-metric breakdown, the infrastructure was intermittently unreliable (model output parsing errors could not be diagnosed directly — only via GitHub issues), and ChatGPT/Match metrics depend on OpenAI API calls that fail when the API quota for the billing period is exhausted.

Full comparison (all models, online inference):

Model Final Acc ChatGPT Lang B1 B2 B3 B4 RL CIDEr Match Coord
Mini-DA† 0.606 0.898 0.668 0.416 0.596 0.564 0.533 0.503 0.651 0.470 0.381 0.000
LoRA-25k 0.560 0.826 0.589 0.459 0.732 0.668 0.606 0.547 0.714 0.230 0.338 0.015
LoRA-25k + DL-PL 10% (this model) 0.589 0.836 0.676 0.451 0.719 0.654 0.589 0.525 0.710 0.222 0.304 0.013
LoRA-25k + DL-PL 30% 0.548 0.832 0.605 0.434 0.695 0.625 0.554 0.483 0.692 0.201 0.264 0.008
LoRA-25k + DL-PL 50% 0.532 0.832 0.584 0.433 0.691 0.622 0.551 0.481 0.699 0.171 0.230 0.007
LoRA-25k + DL-PL 100% 0.511 0.805 0.544 0.430 0.687 0.616 0.543 0.470 0.695 0.165 0.232 0.007
LoRA-300k 0.493 0.339 0.706 0.412 0.607 0.552 0.501 0.452 0.676 0.323 0.303 0.006

† Mini-DA = OpenGVLab/Mini-InternVL2-2B-DA-DriveLM, fine-tuned by OpenGVLab on DriveLM. DL-PL 30%, 50%, 100% variants not publicly released (lower performance without quality filtering).


Training data

Combined training data (35,694 QA pairs total):

1. Custom DriveLM-nuScenes split — 25,825 QA pairs

A scene-level 80/10/10 train/val/test repartition of DriveLM-nuScenes constructed to be distributionally aligned with the test question-type distribution. The official DriveLM split is non-i.i.d. due to QA-template distributional mismatch (~65% of training templates absent from test), causing models trained on it to underperform. See github.com/dmitrykhursen/VQA-AD-CTU for the full split construction methodology.

2. DriveLM pseudo-labels — 10% (~9,900 pairs)

VQA pairs generated by the thesis pipeline from nuScenes sensor priors: 2D/3D bounding boxes, LiDAR-derived per-object distances, and object tracking trajectories, orchestrated by Qwen3 with chain-of-thought reasoning. The full pseudo-label corpus (~100k pairs) is at dkhursen/drivelm-pseudo-labels.

10% augmentation is the optimal mixing ratio — higher ratios degrade performance without quality filtering (30%: 0.548, 50%: 0.532, 100%: 0.511).


Training configuration

Parameter Value
Base model OpenGVLab/InternVL2-2B
Fine-tuning method LoRA (merged into weights in this checkpoint)
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules all linear layers in the LLM (backbone and MLP frozen)
Learning rate 4e-5 (cosine, warmup 0.03)
Effective batch size 64 (8 GPU × 4 × grad-acc 2)
Epochs 10
Max sequence length 8192
Precision bfloat16
Hardware 8 × NVIDIA A100 40 GB

Full configuration: configs/finetune/internvl2_2b_lora.yaml


How to use

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModel
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

MODEL_ID   = "dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct"
IMAGE_PATH = "path/to/stitched_6camera.jpg"
QUESTION   = "<image>\nWhat are the important objects in the current scene? Those objects will be considered for the future reasoning and driving decision."

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD  = (0.229, 0.224, 0.225)


def build_transform(input_size):
    return T.Compose([
        T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
    ])


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float("inf")
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        ratio_diff = abs(aspect_ratio - ratio[0] / ratio[1])
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height
    target_ratios = sorted(
        {(i, j)
         for n in range(min_num, max_num + 1)
         for i in range(1, n + 1)
         for j in range(1, n + 1)
         if min_num <= i * j <= max_num},
        key=lambda x: x[0] * x[1],
    )
    best_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size
    )
    target_w = image_size * best_ratio[0]
    target_h = image_size * best_ratio[1]
    cols = best_ratio[0]
    resized = image.resize((target_w, target_h))
    tiles = []
    for i in range(best_ratio[0] * best_ratio[1]):
        col = i % cols
        row = i // cols
        box = (col * image_size, row * image_size,
               (col + 1) * image_size, (row + 1) * image_size)
        tiles.append(resized.crop(box))
    if use_thumbnail and len(tiles) != 1:
        tiles.append(image.resize((image_size, image_size)))
    return tiles


def load_image(image_path, input_size=448, max_num=12):
    image = Image.open(image_path).convert("RGB")
    transform = build_transform(input_size)
    tiles = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    return torch.stack([transform(tile) for tile in tiles])


tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).eval().cuda()

pixel_values = load_image(IMAGE_PATH, max_num=12).to(torch.bfloat16).cuda()
generation_config = {"max_new_tokens": 512, "do_sample": False}

response = model.chat(tokenizer, pixel_values, QUESTION, generation_config)
print("Answer:", response)

Batch inference and evaluation (requires cloning the repo):

git clone https://github.com/dmitrykhursen/VQA-AD-CTU
cd VQA-AD-CTU
bash scripts/setup_env.sh && source vqa-ad-ctu-env/bin/activate

# Run inference (edit MODEL= inside the script)
sbatch scripts/05_inference.sh

# Evaluate
bash scripts/06_evaluate.sh inference/outputs/<MODEL>/local_test.json

Example predictions

Input: stitched 6-camera panoramic view (CAM_FRONT, CAM_FRONT_LEFT, CAM_FRONT_RIGHT, CAM_BACK, CAM_BACK_LEFT, CAM_BACK_RIGHT) from a single nuScenes keyframe.

Category Question Ground Truth This model Pretrained InternVL2-2B
perception What is the moving status of <c1,CAM_FRONT,441.7,485.0>? A. Turn right. B. Going ahead. C. Stopped. D. Drive backward. B B C. Stopped.
prediction Is <c1,CAM_FRONT,441.7,485.0> a traffic sign or a road barrier? No. No. To determine if the coordinates <c1,CAM_FRONT,441.7,485.0> represent a traffic sign or a road barrier, let's analyze the image closely. The coordinates are given in the format (x, y, z) … Therefore, the coordinates <c1,CAM_FRONT,441.7,485.0> do not represent a traffic sign or road barrier.
planning What actions could the ego vehicle take based on <c1,CAM_FRONT,441.7,485.0>? The action is to brake gently to a stop because there is no safety issue, with a high probability. The action is to keep going at the same speed. The reason is that there is no safety issue. The probability is high. The ego vehicle, indicated by the camera's position (C1, CAM_FRONT, 441.7, 485.0), is positioned at the intersection of a two-way street. Based on this position, the ego vehicle could take several actions: … In summary, the ego vehicle has a high probability of continuing straight through the intersection, making a left turn, or continuing straight through the intersection.
behavior Predict the behavior of the ego vehicle. A. Going straight, slowly. B. Steering left, very fast. C. Steering right, slowly. D. Steering left, normal speed. A A C. Steering right, slowly.

Browse 25 sampled scenes: dmitrykhursen.github.io/VQA-AD-CTU


Intended use and limitations

Intended use:

  • Autonomous driving VQA scene understanding
  • Benchmarking on DriveLM
  • Starting point for further fine-tuning

Limitations:

  • Visual object localization is the dominant bottleneck: providing ground-truth annotations at test time raises Final Score from 0.589 to 0.775 — near-zero Coord (0.013) confirms the model struggles with precise spatial grounding
  • No quality filtering was applied to pseudo-labels — every generated QA pair is used at face value; this artificially caps the usable augmentation ratio, as the degrading performance at 30–100% mixing confirms
  • The pseudo-label generation templates have an ego-centric forward camera bias, disproportionately referencing objects in the forward field of view; this contributes to the performance drop at higher mixing ratios
  • Not intended for deployment in real autonomous driving systems
  • Trained and evaluated on nuScenes scenes from Boston and Singapore only

Resources

Resource Link
Code & scripts github.com/dmitrykhursen/VQA-AD-CTU
Demo gallery dmitrykhursen.github.io/VQA-AD-CTU
Pseudo-label dataset dkhursen/drivelm-pseudo-labels
Base model OpenGVLab/InternVL2-2B
DriveLM benchmark github.com/OpenDriveLab/DriveLM
nuScenes dataset nuscenes.org

Citation

@mastersthesis{khursenko2026vqa,
  author     = {Khursenko, Dmytro},
  title      = {Visual Question Answering for Autonomous Driving},
  school     = {Czech Technical University in Prague, Faculty of Electrical Engineering},
  year       = {2026},
  supervisor = {Hurych, David and Tolias, Georgios}
}
@inproceedings{sima2024drivelm,
  title     = {DriveLM: Driving with Graph Visual Question Answering},
  author    = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}
}
Downloads last month
117
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct

Finetuned
(18)
this model

Datasets used to train dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct