InternVL2-2B — LoRA fine-tuned on DriveLM (25k + 10% Pseudo-Labels) ⭐ Best model

Part of the Master's thesis:

Visual Question Answering for Autonomous Driving Dmytro Khursenko, Czech Technical University in Prague, Faculty of Electrical Engineering, 2026. Supervised by Ing. David Hurych, Ph.D. (Valeo) and doc. Georgios Tolias, Ph.D. (CTU FEE).

GitHub · Demo · Pseudo-label dataset

📄 Thesis PDF expected by end of June 2026, following successful defense (CTU FEE).

This is the best-performing online checkpoint from the thesis: InternVL2-2B fine-tuned with LoRA on the custom 25k i.i.d. DriveLM split augmented with 10% Qwen3-generated pseudo-labels (~9,900 additional pairs), totalling 35,694 QA pairs. Final Score 0.589 — the best result across all thesis experiments without oracle visual annotations.

This checkpoint contains fully merged weights — the LoRA adapter is already merged into the base model. Load and use it directly with AutoModel.

All released models

Model	HF repo	Final
LoRA-25k	dkhursen/InternVL2-2b-LoRA-25k-drivelm	0.560
LoRA-300k	dkhursen/InternVL2-2b-LoRA-300k-drivelm	0.493
LoRA-25k + DL-PL 10% (this model) ⭐	dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct	0.589
LoRA-25k + Oracle annotation	dkhursen/InternVL2-2b-LoRA-25k-drivelm-offline-redcircle-ctag-bkgd	0.775

Pseudo-label dataset: dkhursen/drivelm-pseudo-labels

Results on custom test split

Evaluated on the custom i.i.d. DriveLM-nuScenes test split (3,340 QA pairs).

Metric	Score
Final	0.589
Accuracy	0.836
ChatGPT	0.676
Language	0.451
BLEU-1	0.719
BLEU-2	0.654
BLEU-3	0.589
BLEU-4	0.525
ROUGE-L	0.710
CIDEr	0.222
Match	0.304
Coord	0.013

Score definitions (all table values normalised to [0, 1])

Accuracy — exact-match on MCQ (A/B/C/D) and Yes/No questions; strict letter-only format required

Language — mean of BLEU-1–4, ROUGE-L, and CIDEr/10

ChatGPT — GPT-3.5-turbo semantic similarity judge (0–100 scale, ÷100 in Final)

Match — (F1_coord × 100 + GPT_match) / 2 on prediction answers (0–100 scale, ÷100 in Final); blends spatial coordinate F1 with a GPT judge

Coord — pure coordinate token F1 at L1 < 16 px; diagnostic only, not included in Final

Final — 0.4 × (GPT/100) + 0.2 × Language + 0.2 × (Match/100) + 0.2 × Accuracy

Full metric breakdown: evaluation/README.md

Note on evaluation scope. All scores are measured on the custom i.i.d. local test split (3,340 QA pairs) using the local DriveLM evaluation script. The official DriveLM evaluation server was not used as the primary reporting source: it returns only aggregate scores without per-metric breakdown, the infrastructure was intermittently unreliable (model output parsing errors could not be diagnosed directly — only via GitHub issues), and ChatGPT/Match metrics depend on OpenAI API calls that fail when the API quota for the billing period is exhausted.

Full comparison (all models, online inference):

Model	Final	Acc	ChatGPT	Lang	B1	B2	B3	B4	RL	CIDEr	Match	Coord
Mini-DA†	0.606	0.898	0.668	0.416	0.596	0.564	0.533	0.503	0.651	0.470	0.381	0.000
LoRA-25k	0.560	0.826	0.589	0.459	0.732	0.668	0.606	0.547	0.714	0.230	0.338	0.015
LoRA-25k + DL-PL 10% (this model)	0.589	0.836	0.676	0.451	0.719	0.654	0.589	0.525	0.710	0.222	0.304	0.013
LoRA-25k + DL-PL 30%	0.548	0.832	0.605	0.434	0.695	0.625	0.554	0.483	0.692	0.201	0.264	0.008
LoRA-25k + DL-PL 50%	0.532	0.832	0.584	0.433	0.691	0.622	0.551	0.481	0.699	0.171	0.230	0.007
LoRA-25k + DL-PL 100%	0.511	0.805	0.544	0.430	0.687	0.616	0.543	0.470	0.695	0.165	0.232	0.007
LoRA-300k	0.493	0.339	0.706	0.412	0.607	0.552	0.501	0.452	0.676	0.323	0.303	0.006

† Mini-DA = OpenGVLab/Mini-InternVL2-2B-DA-DriveLM, fine-tuned by OpenGVLab on DriveLM. DL-PL 30%, 50%, 100% variants not publicly released (lower performance without quality filtering).

Training data

Combined training data (35,694 QA pairs total):

1. Custom DriveLM-nuScenes split — 25,825 QA pairs

A scene-level 80/10/10 train/val/test repartition of DriveLM-nuScenes constructed to be distributionally aligned with the test question-type distribution. The official DriveLM split is non-i.i.d. due to QA-template distributional mismatch (~65% of training templates absent from test), causing models trained on it to underperform. See github.com/dmitrykhursen/VQA-AD-CTU for the full split construction methodology.

2. DriveLM pseudo-labels — 10% (~9,900 pairs)

VQA pairs generated by the thesis pipeline from nuScenes sensor priors: 2D/3D bounding boxes, LiDAR-derived per-object distances, and object tracking trajectories, orchestrated by Qwen3 with chain-of-thought reasoning. The full pseudo-label corpus (~100k pairs) is at dkhursen/drivelm-pseudo-labels.

10% augmentation is the optimal mixing ratio — higher ratios degrade performance without quality filtering (30%: 0.548, 50%: 0.532, 100%: 0.511).

Training configuration

Parameter	Value
Base model	OpenGVLab/InternVL2-2B
Fine-tuning method	LoRA (merged into weights in this checkpoint)
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	all linear layers in the LLM (backbone and MLP frozen)
Learning rate	4e-5 (cosine, warmup 0.03)
Effective batch size	64 (8 GPU × 4 × grad-acc 2)
Epochs	10
Max sequence length	8192
Precision	bfloat16
Hardware	8 × NVIDIA A100 40 GB

Full configuration: configs/finetune/internvl2_2b_lora.yaml

How to use

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModel
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode

MODEL_ID   = "dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct"
IMAGE_PATH = "path/to/stitched_6camera.jpg"
QUESTION   = "<image>\nWhat are the important objects in the current scene? Those objects will be considered for the future reasoning and driving decision."

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD  = (0.229, 0.224, 0.225)


def build_transform(input_size):
    return T.Compose([
        T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
    ])


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float("inf")
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        ratio_diff = abs(aspect_ratio - ratio[0] / ratio[1])
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height
    target_ratios = sorted(
        {(i, j)
         for n in range(min_num, max_num + 1)
         for i in range(1, n + 1)
         for j in range(1, n + 1)
         if min_num <= i * j <= max_num},
        key=lambda x: x[0] * x[1],
    )
    best_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size
    )
    target_w = image_size * best_ratio[0]
    target_h = image_size * best_ratio[1]
    cols = best_ratio[0]
    resized = image.resize((target_w, target_h))
    tiles = []
    for i in range(best_ratio[0] * best_ratio[1]):
        col = i % cols
        row = i // cols
        box = (col * image_size, row * image_size,
               (col + 1) * image_size, (row + 1) * image_size)
        tiles.append(resized.crop(box))
    if use_thumbnail and len(tiles) != 1:
        tiles.append(image.resize((image_size, image_size)))
    return tiles


def load_image(image_path, input_size=448, max_num=12):
    image = Image.open(image_path).convert("RGB")
    transform = build_transform(input_size)
    tiles = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    return torch.stack([transform(tile) for tile in tiles])


tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
).eval().cuda()

pixel_values = load_image(IMAGE_PATH, max_num=12).to(torch.bfloat16).cuda()
generation_config = {"max_new_tokens": 512, "do_sample": False}

response = model.chat(tokenizer, pixel_values, QUESTION, generation_config)
print("Answer:", response)

Batch inference and evaluation (requires cloning the repo):

git clone https://github.com/dmitrykhursen/VQA-AD-CTU
cd VQA-AD-CTU
bash scripts/setup_env.sh && source vqa-ad-ctu-env/bin/activate

# Run inference (edit MODEL= inside the script)
sbatch scripts/05_inference.sh

# Evaluate
bash scripts/06_evaluate.sh inference/outputs/<MODEL>/local_test.json

Example predictions

Input: stitched 6-camera panoramic view (CAM_FRONT, CAM_FRONT_LEFT, CAM_FRONT_RIGHT, CAM_BACK, CAM_BACK_LEFT, CAM_BACK_RIGHT) from a single nuScenes keyframe.

Category	Question	Ground Truth	This model	Pretrained InternVL2-2B
perception	What is the moving status of `<c1,CAM_FRONT,441.7,485.0>`? A. Turn right. B. Going ahead. C. Stopped. D. Drive backward.	B	B	C. Stopped.
prediction	Is `<c1,CAM_FRONT,441.7,485.0>` a traffic sign or a road barrier?	No.	No.	To determine if the coordinates `<c1,CAM_FRONT,441.7,485.0>` represent a traffic sign or a road barrier, let's analyze the image closely. The coordinates are given in the format (x, y, z) … Therefore, the coordinates `<c1,CAM_FRONT,441.7,485.0>` do not represent a traffic sign or road barrier.
planning	What actions could the ego vehicle take based on `<c1,CAM_FRONT,441.7,485.0>`?	The action is to brake gently to a stop because there is no safety issue, with a high probability.	The action is to keep going at the same speed. The reason is that there is no safety issue. The probability is high.	The ego vehicle, indicated by the camera's position (C1, CAM_FRONT, 441.7, 485.0), is positioned at the intersection of a two-way street. Based on this position, the ego vehicle could take several actions: … In summary, the ego vehicle has a high probability of continuing straight through the intersection, making a left turn, or continuing straight through the intersection.
behavior	Predict the behavior of the ego vehicle. A. Going straight, slowly. B. Steering left, very fast. C. Steering right, slowly. D. Steering left, normal speed.	A	A	C. Steering right, slowly.

Browse 25 sampled scenes: dmitrykhursen.github.io/VQA-AD-CTU

Intended use and limitations

Intended use:

Autonomous driving VQA scene understanding
Benchmarking on DriveLM
Starting point for further fine-tuning

Limitations:

Visual object localization is the dominant bottleneck: providing ground-truth annotations at test time raises Final Score from 0.589 to 0.775 — near-zero Coord (0.013) confirms the model struggles with precise spatial grounding
No quality filtering was applied to pseudo-labels — every generated QA pair is used at face value; this artificially caps the usable augmentation ratio, as the degrading performance at 30–100% mixing confirms
The pseudo-label generation templates have an ego-centric forward camera bias, disproportionately referencing objects in the forward field of view; this contributes to the performance drop at higher mixing ratios
Not intended for deployment in real autonomous driving systems
Trained and evaluated on nuScenes scenes from Boston and Singapore only

Resources

Resource	Link
Code & scripts	github.com/dmitrykhursen/VQA-AD-CTU
Demo gallery	dmitrykhursen.github.io/VQA-AD-CTU
Pseudo-label dataset	dkhursen/drivelm-pseudo-labels
Base model	OpenGVLab/InternVL2-2B
DriveLM benchmark	github.com/OpenDriveLab/DriveLM
nuScenes dataset	nuscenes.org

Citation

@mastersthesis{khursenko2026vqa,
  author     = {Khursenko, Dmytro},
  title      = {Visual Question Answering for Autonomous Driving},
  school     = {Czech Technical University in Prague, Faculty of Electrical Engineering},
  year       = {2026},
  supervisor = {Hurych, David and Tolias, Georgios}
}

@inproceedings{sima2024drivelm,
  title     = {DriveLM: Driving with Graph Visual Question Answering},
  author    = {Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}
}

Downloads last month: 117

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dkhursen/InternVL2-2b-LoRA-25k_plus_DL-PL-10pct

Base model

OpenGVLab/InternVL2-2B

Finetuned

(18)

this model

dkhursen
/

InternVL2-2b-LoRA-25k_plus_DL-PL-10pct