Geo-trax: YOLOv8s Vehicle Detector for Drone BEV Imagery

GitHub PyPI Paper arXiv Zenodo YouTube Website License

This is the default detection model for Geo-trax, a comprehensive pipeline for extracting georeferenced vehicle trajectories from high-altitude drone (bird's-eye view) video footage. The model detects vehicles in aerial imagery and underpins the results reported in the associated publication.

Geo-trax Output Visualization

🎬 This accelerated animation previews some of the capabilities of Geo-trax. Watch the full demonstration (~4 min) on YouTube.

Model Details

Property Value
Architecture YOLOv8s (HBB, horizontal bounding boxes)
Input resolution 1920 × 1920 px
Classes 6 trained (4 primary + 2 auxiliary; see below)
Parameters ~11 M
Framework Ultralytics ≥ 8.4.64
Trained on 19,339 annotated aerial images (679,306 labeled instances); multi-stage, see publication
Validated on Songdo Vision test set (1,084 images, 55,124 vehicle instances)

Classes and Detection Performance

Metrics reported on the Songdo Vision test split (1,084 images, 55,124 labeled vehicle instances). The Instances column is the per-class support in the test set. See Table 3 of the publication for full results.

ID Label Notes Instances Precision Recall mAP@50 mAP@50-95
0 Car incl. vans 49,508 0.979 0.981 0.992 0.835
1 Bus 1,759 0.952 0.977 0.988 0.826
2 Truck 3,052 0.887 0.916 0.935 0.722
3 Motorcycle 805 0.827 0.866 0.888 0.463
4 Pedestrian not evaluated n/a n/a n/a n/a n/a
5 Bicycle not evaluated n/a n/a n/a n/a n/a
All 55,124 0.911 0.935 0.951 0.711

The model reaches 0.951 mAP@50 and 0.711 mAP@50-95 overall, with near-saturated accuracy on cars and buses (mAP@50 ≥ 0.988). Trucks and especially motorcycles are harder: motorcycles are small, sparse in the test set (805 instances), and the main driver of the lower mAP@50-95.

Evaluation Plots

Precision-recall curves and the normalized confusion matrix on the Songdo Vision test set:

Precision-Recall Curve
Precision-Recall Curve
Normalized Confusion Matrix
Normalized Confusion Matrix

Note on pedestrian and bicycle classes: The model was trained on pedestrian and bicycle instances; however, these classes are not evaluated and not recommended for use. They were underrepresented in the training data, are not annotated in the Songdo Vision dataset (making reliable evaluation impossible), and achieve poor detection performance in practice.

How to Use

With Geo-trax (recommended)

This model is the default in Geo-trax and downloads automatically on first use:

pip install geo-trax
geotrax extract video.mp4           # detect, track, and stabilize; auto-downloads the model
geotrax batch   video.mp4 --no-geo  # detect, track, and stabilize; skip georeferencing
geotrax batch   video.mp4           # full pipeline including georeferencing (requires orthophotos)

See the Geo-trax GitHub README for the full pipeline, configuration options, and georeferencing.

Direct Ultralytics inference

from ultralytics import YOLO
from huggingface_hub import hf_hub_download

weights = hf_hub_download(repo_id="rfonod/geo-trax", filename="geotrax_hbb_yolov8s_1920_v1.pt")
model = YOLO(weights)

results = model("drone_frame.jpg", imgsz=1920, conf=0.25, iou=0.45, classes=[0, 1, 2, 3])
results[0].show()

Tip: The model was trained and validated at 1920 px input resolution. Downscaling to 1280 px is possible with a small accuracy trade-off; going below 960 px significantly degrades detection of small vehicles (motorcycles, distant cars). Pass classes=[0, 1, 2, 3] to restrict inference to the four evaluated classes and suppress unreliable predictions.

Training Data

Training followed a multi-stage strategy starting from YOLOv8s weights pretrained on COCO as the initial foundation. Two successive stages were applied:

Stage 1 (BASE): The model was trained on a large, diverse collection drawn from eight public aerial and drone datasets (CARPK, PUCPR+, CyCAR, UAVDT, HARPY, RAI4VD, UIT-ADrone, and VisDrone) combined with the Songdo Vision dataset, totalling 19,339 training images with 679,306 annotations across 6 vehicle classes (car, bus, truck, motorcycle, pedestrian, bicycle).

Stage 2 (FINE): The BASE-trained model was subsequently fine-tuned on a curated, high-quality subset of 9,004 images with 321,368 annotations, emphasising accurate annotations and higher-resolution images, again combined with Songdo Vision, to yield the final weights released here.

Training set composition (annotations per class):

Stage Images Annotations Car Bus Truck Motorcycle Pedestrian Bicycle
BASE 19,339 679,306 561,666 15,587 28,830 44,512 24,239 4,472
FINE 9,004 321,368 266,745 8,047 14,305 30,925 1,260 86

Songdo Vision comprises 5,419 annotated drone frames (4,335 training / 1,084 test; 80/20 split) collected during a large-scale urban traffic monitoring experiment in Songdo, South Korea. It covers four primary vehicle classes captured at 140-150 m altitude by DJI Mavic 3 drones, contributing 217,311 training and 55,124 test instances to the totals above.

Training configuration:

Setting Value
Initialization YOLOv8s pretrained on COCO
Optimizer SGD
Learning rate (initial / final factor) 0.01 / 0.01
Momentum 0.937
Weight decay 0.0005
Batch size 8
Early stopping 50-epoch patience
Input resolution 1920 × 1920 px (letterbox padding)
Mixed precision AMP enabled
Augmentation random scaling, translation, horizontal flip, mosaic, colour jitter, Gaussian/median blur, grayscale, CLAHE

See the publication for complete dataset statistics, training details, and ablation results.

Intended Use and Limitations

  • GSD assumption: The bundled Geo-trax config assumes a ground sampling distance (GSD) of ~0.027 m/px (DJI Mavic 3, 4K, 140-150 m altitude). Adjust this value in the config for different hardware or flight altitudes.
  • Supported classes: Car, bus, truck, and motorcycle (class IDs 0-3). The model was also trained on pedestrian and bicycle instances; however, these classes achieve poor detection performance and are not recommended for use (see the class table above). Geo-trax filters to the four primary classes by default; when using Ultralytics directly, pass classes=[0, 1, 2, 3] to suppress unreliable predictions.

Citation

If you use this model, please cite the associated publication:

@article{fonod2025advanced,
  title   = {Advanced computer vision for extracting georeferenced vehicle trajectories from drone imagery},
  author  = {Fonod, Robert and Cho, Haechan and Yeo, Hwasoo and Geroliminis, Nikolas},
  journal = {Transportation Research Part C: Emerging Technologies},
  volume  = {178},
  pages   = {105205},
  year    = {2025},
  doi     = {10.1016/j.trc.2025.105205}
}

If you additionally use the Geo-trax software, please also cite the specific version you used via its Zenodo record. For example, for version 1.0.0:

@software{fonod2026geo-trax,
  author  = {Fonod, Robert},
  title   = {Geo-trax: A Comprehensive Framework for Georeferenced Vehicle Trajectory Extraction from Drone Imagery},
  url     = {https://github.com/rfonod/geo-trax},
  doi     = {10.5281/zenodo.12119542},
  version = {1.0.0},
  year    = {2026}
}

License

This model is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license; see the LICENSE file for the full terms. The Geo-trax codebase is distributed separately under the MIT License.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rfonod/geo-trax

Finetuned
(162)
this model

Datasets used to train rfonod/geo-trax

Collection including rfonod/geo-trax

Paper for rfonod/geo-trax

Evaluation results