YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PrefDrive: LoRA DPO LLaMa-7B for Autonomous Driving

This repository contains LoRA (Low-Rank Adaptation) parameters for a fine-tuned version of LLaMa-7B using Direct Preference Optimization (DPO). The model is trained to better align with specific driving behaviors and operational requirements through preference learning, significantly improving autonomous driving performance.

Model Details

  • Base Model: meta-llama/Llama-2-7b
  • Training Method: Direct Preference Optimization (DPO)
  • LoRA Parameters:
    • Rank (r): 16
    • Alpha (α): 16
    • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, down_proj, up_proj
  • Training Framework: Unsloth + TRL
  • Training Precision: 4-bit Quantization

Training Configuration

Parameter Value
Base Model LLaMA2-7B
Training Strategy LoRA
Learning Rate 1e-5
Batch Size 4
Gradient Accumulation Steps 2
Training Epochs 3
Maximum Sequence Length 2,048
Warmup Ratio 0.1
Max Gradient Norm 0.3
DPO Beta (β) 0.1
Loss Type Sigmoid
Training Data Chosen & Rejected action pairs

Usage

To use this model, you'll need to load both the base model and the LoRA adapter:

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model_id = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(base_model_id)

# Load LoRA adapter
peft_model_id = "[YOUR_USERNAME]/lora-dpo-llama-7b"
model = PeftModel.from_pretrained(model, peft_model_id)

# Use model for inference
inputs = tokenizer("Hello, please", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For faster inference with Unsloth:

from unsloth import FastLanguageModel

# Load the model with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True,
    max_seq_length=2048
)

# Load LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    "[YOUR_USERNAME]/lora-dpo-llama-7b",
)

# Use model for inference
inputs = tokenizer("Hello, please", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Dataset

The model was trained on the PrefDrive dataset, a comprehensive collection of 74,040 driving sequences carefully annotated with driving preferences and driving decisions. Each entry in the dataset consists of:

  • A driving scenario description (s)
  • A preferred/chosen driving action with its reasoning and resulting waypoint (a_p)
  • A rejected driving action with its reasoning and resulting waypoint (a_r)

This dataset captures various autonomous driving scenarios with emphasis on proper distance maintenance, trajectory smoothness, traffic rule compliance, and route adherence.

Training Procedure

The model was trained using the DPO method which directly optimizes a language model to align with driving preferences without requiring a reward model. The training process uses pairwise comparisons between preferred and rejected driving actions to update model parameters.

Methodology

The PrefDrive methodology for autonomous driving is formulated as:

$\mathcal{L}{DPO} = -\mathbb{E}{(s,a_p,a_r)\sim\mathcal{D}}\Big[\log\sigma\Big(\beta\log\frac{\pi_\theta(a_p|s)}{\pi_{ref}(a_p|s)} - \beta\log\frac{\pi_\theta(a_r|s)}{\pi_{ref}(a_r|s)}\Big)\Big]$

where:

  • $\mathcal{D}$ represents our driving preference dataset
  • $s$ denotes the current driving scenario description
  • $a_p$ represents the preferred (chosen) driving action with its reasoning and resulting waypoint
  • $a_r$ represents the rejected driving action with its reasoning and resulting waypoint
  • $\pi_\theta$ is the policy model being trained
  • $\pi_{ref}$ is the initial reference model
  • $\beta$ controls the preference learning sensitivity (set to 0.1)
  • $\sigma$ represents the sigmoid function

This formulation explicitly shows how our model learns to favor chosen driving actions over rejected ones while maintaining reasonable deviation from the reference model's behavior.

Key Training Parameters

  • Learning rate: 1e-5
  • Number of epochs: 3
  • DPO beta: 0.1
  • Loss type: Sigmoid
  • Max sequence length: 2048

Evaluation Results

The model was evaluated in the CARLA simulator across different town environments. Here are the performance metrics:

Town 01 Performance

Metric LMDrive (baseline) PrefDrive (Ours) Improvement
Composite Score 53.00 56.12 +5.9%
Penalty Score 0.86 0.88 +1.5%
Route Completion 59.10 64.15 +8.5%
Layout Collisions 0.73 0.27 -63.5%
Traffic Light Violations 0.22 0.16 -28.1%
Route Deviation 1.32 1.36 +3.0%
Vehicle Blocked 0.11 0.00 -100.0%

Town 04 Performance

Metric LMDrive (baseline) PrefDrive (Ours) Improvement
Composite Score 60.11 65.93 +9.7%
Penalty Score 0.93 0.96 +3.2%
Route Completion 65.25 69.93 +7.2%
Layout Collisions 0.00 0.00 0.0%
Traffic Light Violations 0.24 0.00 -100.0%
Route Deviation 1.86 1.77 -4.8%
Vehicle Blocked 0.00 0.00 0.0%

The results demonstrate significant improvements in crucial metrics, particularly in reducing traffic light violations and layout collisions while improving route completion.

Limitations and Biases

This model inherits the limitations and biases from the base LLaMa model. Additionally:

  • It's optimized specifically for autonomous driving tasks and may not perform well in unrelated domains
  • Performance may vary in driving environments that differ significantly from the training data
  • The LoRA adaptation affects specific parameter matrices and may not fully transform the base model's capabilities
  • While the model shows improved performance in simulated environments (CARLA), its behavior in real-world driving scenarios would require further validation and safety testing
  • The model is designed to work with a specific autonomous driving stack and may require adaptation for different setups

Ethical Considerations

When using this model, consider:

  • The potential for generating harmful, misleading, or biased content
  • The limitations in factual accuracy and reasoning abilities
  • The need for appropriate content filtering in production applications

Citations

@INPROCEEDINGS{Li2025,
  title={PrefDrive: A Preference Learning Framework for Autonomous Driving with Large Language Models}, 
  author={Li, Yun and Javanmardi, Ehsan and Thompson, Simon and Katsumata, Kai and Orsholits, Alex and Tsukada, Manabu},
  booktitle = "{2025 IEEE Intelligent Vehicles Symposium (IV)}",
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support