---
base_model:
- OpenGVLab/InternVL2_5-8B
language:
- en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
datasets:
- ayeshaishaq/DriveLMMo1
---

**DriveLMM-o1: A Large Multimodal Model for Autonomous Driving Reasoning**

[Paper](https://arxiv.org/abs/2503.10621)

DriveLMM-o1 is a fine-tuned large multimodal model designed for autonomous driving. Built on InternVL2.5-8B with LoRA-based adaptation, it leverages stitched multiview images to produce step-by-step reasoning. This structured approach enhances both final decision accuracy and interpretability in complex driving tasks like perception, prediction, and planning.

**Key Features:**
- **Multimodal Integration:** Combines multiview images for comprehensive scene understanding.
- **Step-by-Step Reasoning:** Produces detailed intermediate reasoning steps to explain decisions.
- **Efficient Adaptation:** Utilizes dynamic image patching and LoRA finetuning for high-resolution inputs with minimal extra parameters.
- **Performance Gains:** Achieves significant improvements in both final answer accuracy and overall reasoning scores compared to previous open-source models.

**Performance Comparison:**

| Model                   | Risk Assessment Accuracy | Traffic Rule Adherence | Scene Awareness & Object Understanding | Relevance | Missing Details | Overall Reasoning Score | Final Answer Accuracy |
|-------------------------|--------------------------|------------------------|------------------------------------------|-----------|-----------------|-------------------------|-----------------------|
| GPT-4o (Closed)    | 71.32                    | 80.72                  | 72.96                                    | 76.65     | 71.43           | 72.52                   | 57.84                 |
| Qwen-2.5-VL-7B    | 46.44                    | 60.45                  | 51.02                                    | 50.15     | 52.19           | 51.77                   | 37.81                 |
| Ovis1.5-Gemma2-9B   | 51.34                    | 66.36                  | 54.74                                    | 55.72     | 55.74           | 55.62                   | 48.85                 |
| Mulberry-7B     | 51.89                    | 63.66                  | 56.68                                    | 57.27     | 57.45           | 57.65                   | 52.86                 |
| LLaVA-CoT         | 57.62                    | 69.01                  | 60.84                                    | 62.72     | 60.67           | 61.41                   | 49.27                 |
| LlamaV-o1          | 60.20                    | 73.52                  | 62.67                                    | 64.66     | 63.41           | 63.13                   | 50.02                 |
| InternVL2.5-8B    | 69.02                    | 78.43                  | 71.52                                    | 75.80     | 70.54           | 71.62                   | 54.87                 |
| **DriveLMM-o1 (Ours)**  | **73.01**                | **81.56**              | **75.39**                                | **79.42** | **74.49**       | **75.24**               | **62.36**             |


**Usage:**

Load the model using the following code snippet:

```python
from transformers import AutoModel, AutoTokenizer
import torch

path = 'ayeshaishaq/DriveLMMo1'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    path,
    trust_remote_code=True,
    use_fast=False
)
```

For detailed usage instructions and additional configurations, please refer to the [OpenGVLab/InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) repository.

Code: [https://github.com/ayesha-ishaq/DriveLMM-o1](https://github.com/ayesha-ishaq/DriveLMM-o1)


**Limitations:**
While DriveLMM-o1 demonstrates strong performance in autonomous driving tasks, it is fine-tuned for domain-specific reasoning. Users may need to further fine-tune or adapt the model for different driving environments.