PrefDrive: LoRA DPO LLaMa-7B for Autonomous Driving
This repository contains LoRA (Low-Rank Adaptation) parameters for a fine-tuned version of LLaMa-7B using Direct Preference Optimization (DPO). The model is trained to better align with specific driving behaviors and operational requirements through preference learning, significantly improving autonomous driving performance.
Model Details
- Base Model: meta-llama/Llama-2-7b
- Training Method: Direct Preference Optimization (DPO)
- LoRA Parameters:
- Rank (r): 16
- Alpha (α): 16
- Target Modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,down_proj
,up_proj
- Training Framework: Unsloth + TRL
- Training Precision: 4-bit Quantization
Training Configuration
Parameter | Value |
---|---|
Base Model | LLaMA2-7B |
Training Strategy | LoRA |
Learning Rate | 1e-5 |
Batch Size | 4 |
Gradient Accumulation Steps | 2 |
Training Epochs | 3 |
Maximum Sequence Length | 2,048 |
Warmup Ratio | 0.1 |
Max Gradient Norm | 0.3 |
DPO Beta (β) | 0.1 |
Loss Type | Sigmoid |
Training Data | Chosen & Rejected action pairs |
Usage
To use this model, you'll need to load both the base model and the LoRA adapter:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model_id = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(base_model_id)
# Load LoRA adapter
peft_model_id = "[YOUR_USERNAME]/lora-dpo-llama-7b"
model = PeftModel.from_pretrained(model, peft_model_id)
# Use model for inference
inputs = tokenizer("Hello, please", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For faster inference with Unsloth:
from unsloth import FastLanguageModel
# Load the model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
"meta-llama/Llama-2-7b",
load_in_4bit=True,
max_seq_length=2048
)
# Load LoRA adapter
model = FastLanguageModel.get_peft_model(
model,
"[YOUR_USERNAME]/lora-dpo-llama-7b",
)
# Use model for inference
inputs = tokenizer("Hello, please", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Dataset
The model was trained on the PrefDrive dataset, a comprehensive collection of 74,040 driving sequences carefully annotated with driving preferences and driving decisions. Each entry in the dataset consists of:
- A driving scenario description (s)
- A preferred/chosen driving action with its reasoning and resulting waypoint (a_p)
- A rejected driving action with its reasoning and resulting waypoint (a_r)
This dataset captures various autonomous driving scenarios with emphasis on proper distance maintenance, trajectory smoothness, traffic rule compliance, and route adherence.
Training Procedure
The model was trained using the DPO method which directly optimizes a language model to align with driving preferences without requiring a reward model. The training process uses pairwise comparisons between preferred and rejected driving actions to update model parameters.
Methodology
The PrefDrive methodology for autonomous driving is formulated as:
$\mathcal{L}{DPO} = -\mathbb{E}{(s,a_p,a_r)\sim\mathcal{D}}\Big[\log\sigma\Big(\beta\log\frac{\pi_\theta(a_p|s)}{\pi_{ref}(a_p|s)} - \beta\log\frac{\pi_\theta(a_r|s)}{\pi_{ref}(a_r|s)}\Big)\Big]$
where:
- $\mathcal{D}$ represents our driving preference dataset
- $s$ denotes the current driving scenario description
- $a_p$ represents the preferred (chosen) driving action with its reasoning and resulting waypoint
- $a_r$ represents the rejected driving action with its reasoning and resulting waypoint
- $\pi_\theta$ is the policy model being trained
- $\pi_{ref}$ is the initial reference model
- $\beta$ controls the preference learning sensitivity (set to 0.1)
- $\sigma$ represents the sigmoid function
This formulation explicitly shows how our model learns to favor chosen driving actions over rejected ones while maintaining reasonable deviation from the reference model's behavior.
Key Training Parameters
- Learning rate: 1e-5
- Number of epochs: 3
- DPO beta: 0.1
- Loss type: Sigmoid
- Max sequence length: 2048
Evaluation Results
The model was evaluated in the CARLA simulator across different town environments. Here are the performance metrics:
Town 01 Performance
Metric | LMDrive (baseline) | PrefDrive (Ours) | Improvement |
---|---|---|---|
Composite Score | 53.00 | 56.12 | +5.9% |
Penalty Score | 0.86 | 0.88 | +1.5% |
Route Completion | 59.10 | 64.15 | +8.5% |
Layout Collisions | 0.73 | 0.27 | -63.5% |
Traffic Light Violations | 0.22 | 0.16 | -28.1% |
Route Deviation | 1.32 | 1.36 | +3.0% |
Vehicle Blocked | 0.11 | 0.00 | -100.0% |
Town 04 Performance
Metric | LMDrive (baseline) | PrefDrive (Ours) | Improvement |
---|---|---|---|
Composite Score | 60.11 | 65.93 | +9.7% |
Penalty Score | 0.93 | 0.96 | +3.2% |
Route Completion | 65.25 | 69.93 | +7.2% |
Layout Collisions | 0.00 | 0.00 | 0.0% |
Traffic Light Violations | 0.24 | 0.00 | -100.0% |
Route Deviation | 1.86 | 1.77 | -4.8% |
Vehicle Blocked | 0.00 | 0.00 | 0.0% |
The results demonstrate significant improvements in crucial metrics, particularly in reducing traffic light violations and layout collisions while improving route completion.
Limitations and Biases
This model inherits the limitations and biases from the base LLaMa model. Additionally:
- It's optimized specifically for autonomous driving tasks and may not perform well in unrelated domains
- Performance may vary in driving environments that differ significantly from the training data
- The LoRA adaptation affects specific parameter matrices and may not fully transform the base model's capabilities
- While the model shows improved performance in simulated environments (CARLA), its behavior in real-world driving scenarios would require further validation and safety testing
- The model is designed to work with a specific autonomous driving stack and may require adaptation for different setups
Ethical Considerations
When using this model, consider:
- The potential for generating harmful, misleading, or biased content
- The limitations in factual accuracy and reasoning abilities
- The need for appropriate content filtering in production applications
Citations
@INPROCEEDINGS{Li2025,
title={PrefDrive: A Preference Learning Framework for Autonomous Driving with Large Language Models},
author={Li, Yun and Javanmardi, Ehsan and Thompson, Simon and Katsumata, Kai and Orsholits, Alex and Tsukada, Manabu},
booktitle = "{2025 IEEE Intelligent Vehicles Symposium (IV)}",
}