Model Card for MMLA DINOv2 8-Class Animal Pose Classifier

A lightweight viewpoint/pose classifier that predicts one of 8 canonical orientations (front, front-left, front-right, left, right, back-left, back-right, back) for an animal crop extracted from aerial drone imagery. It pairs a frozen DINOv2 vision-transformer backbone with a small trainable MLP head, and is intended for use as a downstream module in a drone-based wildlife detection-and-navigation pipeline.

Model Details

Model Description

This model takes a 224×224 RGB image crop of a single animal (typically produced by an upstream detector) and outputs a categorical prediction over 8 viewpoint classes arranged around the animal. The 8 classes form a discretization of the animal's heading relative to the camera, with adjacent classes separated by ~45°.

The DINOv2 backbone is loaded via torch.hub from facebookresearch/dinov2 and is kept frozen during training; only the MLP head is updated. This keeps the number of trainable parameters low (well under 1M for the small variant), reduces overfitting on small labeled pose datasets, and allows the same self-supervised representation to be reused for related downstream tasks.

Developed by: Imageomics Institute — Individual Identification of Zebras project (Claire Sun, et al.)
Model type: Image classifier (Vision Transformer feature extractor + MLP head)
Language(s) (NLP): N/A (vision model)
License: [More Information Needed — choose a license (see above notes)]
Fine-tuned from model: facebookresearch/dinov2 (dinov2_vits14, dinov2_vitb14, or dinov2_vitl14)

Model Sources

Repository: Imageomics/individual-id-drones
Training script: train_pose_classifier.py
Inference wrapper: pose_classifier.py
User guide: POSE_CLASSIFIER_GUIDE.md
Paper: [More Information Needed — optional]
Demo: [More Information Needed — encouraged]

Uses

Direct Use

The model is intended to be applied to tight, single-animal crops (e.g., the output of a wildlife detector run on aerial drone frames). For each crop it returns the most likely of 8 viewpoint labels:

front, front-left, front-right, left, right, back-left, back-right, back

These labels are useful for:

Selecting frames in which an individual is best observed (e.g., side profiles for stripe-based re-identification).
Filtering training data for downstream identity models that are viewpoint-sensitive.
Behavioral analysis (e.g., orientation of herd members relative to the camera/drone).

Downstream Use

This pose classifier is a component of a larger drone navigation and individual-identification pipeline for zebras and giraffes. Downstream uses include:

Conditioning a re-identification model on viewpoint.
Informing autonomous drone-positioning policies (e.g., maneuver to obtain a side-profile view).
Producing per-track viewpoint histograms used for sighting quality scoring.

Out-of-Scope Use

Non-aerial / ground-level imagery. The model is trained on top-down/oblique drone footage; predictions on eye-level photos are unlikely to be reliable.
Species the model was not trained on. Performance has only been characterized for zebras and giraffes. Application to unrelated species is out of scope without retraining.
Continuous heading regression. The model predicts 1-of-8 discrete classes, not a continuous angle. Adjacent classes (e.g., front vs front-left) are frequently confused and should not be treated as fully independent.
Identity, species, or behavior inference. The model does not predict the identity, species, or activity of the animal.

Bias, Risks, and Limitations

Domain shift: Training data is drawn primarily from aerial drone footage at two field sites (Mpala and OPC). Performance may degrade on imagery captured at other altitudes, lighting conditions, or camera angles.
Class adjacency confusion: Because viewpoint is fundamentally continuous, errors are concentrated between neighboring classes (e.g., front ↔ front-left). The 8-class discretization is a modeling choice, not a property of the underlying phenomenon.
Species imbalance: Most training samples are zebras; giraffe coverage is smaller and per-class performance has not been independently broken out.
Occlusion sensitivity: Heavily occluded or truncated crops (animals partially out of frame, overlapping individuals) are not well represented and tend to produce less reliable predictions.
Tight-crop dependence: The model expects detector-style crops centered on a single animal. Wide-scene images will not produce meaningful predictions.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. In particular:

Treat adjacent-class confusion (e.g., front/front-left) as expected and consider collapsing to coarser bins (front/side/back) for decisions that don't need fine resolution.
Apply temporal smoothing or majority voting across consecutive frames when classifying tracked individuals.
Confidence-threshold or hold out predictions on visibly occluded crops.
Re-evaluate (or retrain) before deploying on a new site, species, or sensor.

How to Get Started with the Model

The inference wrapper ViewPointClassifier provides a one-line interface that takes a list of PIL crops and returns a list of pose labels.

from PIL import Image
from pose_classifier import ViewPointClassifier

classifier = ViewPointClassifier(
    weight_path="checkpoints/best_pose_model.pth",
    model_size="small",          # must match the trained checkpoint
    device="cpu",                # or "cuda"
)

crops = [Image.open(p).convert("RGB") for p in ["zebra1.jpg", "zebra2.jpg"]]
poses = classifier(crops)
# e.g. ['front-left', 'back']

The wrapper handles preprocessing (resize to 256, center-crop to 224, ImageNet normalization) and accepts PIL images, NumPy arrays, or torch tensors as input.

To train from scratch on a new pose-labeled dataset:

python train_pose_classifier.py \
    --data_dir ./pose_labels \
    --model_size small \
    --epochs 30 \
    --batch_size 32 \
    --lr 1e-3

See POSE_CLASSIFIER_GUIDE.md for the full guide, including the visual reference diagram for each pose class.

Training Details

Training Data

Pose-labeled crops of zebras and giraffes extracted from aerial drone footage at Mpala (Kenya) and OPC field sites. Data is organized either as a per-class folder hierarchy:

pose_labels/
  front/  front-left/  front-right/  left/  right/  back-left/  back-right/  back/

or as a CSV with image_path, pose columns. Class counts are inherently imbalanced and are handled at the sampler level (see below).

This model was fine-tuned on the MMLA pose dataset:

Dataset: imageomics/mmla-pose

The dataset contains cropped images of zebras from MMLA drone footage labeled with one of eight pose orientations: front, front-left, front-right, left, back-left, back, back-right, and right.

Training Procedure

Preprocessing

Training-time transforms (applied per image):

Resize shorter side to 256
RandomCrop(224)
ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2)
RandomRotation(±15°)
ToTensor + ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

Validation-time transforms: Resize(256) → CenterCrop(224) → ToTensor → Normalize.

Symmetry-aware horizontal flip: with p=0.5 the crop is horizontally flipped and the label is swapped according to the canonical symmetry of the 8-class scheme:

left  ↔ right
front-left ↔ front-right
back-left ↔ back-right
front, back   unchanged

This effectively doubles training data without breaking label semantics.

Class balancing: a WeightedRandomSampler with weights inversely proportional to per-class frequency ensures all 8 classes are sampled at equal rates during training.

Training Hyperparameters

Training regime: fp16 mixed precision when running on CUDA (via torch.cuda.amp); fp32 on CPU.
Optimizer: AdamW, lr=1e-3, weight_decay=0.01 (head parameters only — backbone is frozen).
Loss: CrossEntropyLoss(label_smoothing=0.1).
LR schedule: CosineAnnealingLR(T_max=epochs).
Default epochs / batch size: 30 / 32.
Backbone: frozen DINOv2 (small = ViT-S/14, 384-dim; base = ViT-B/14, 768-dim; large = ViT-L/14, 1024-dim).
Head: LayerNorm → Linear(feat_dim, 256) → GELU → Dropout(0.3) → Linear(256, 128) → GELU → Dropout(0.3) → Linear(128, 8).

Only the MLP head is trained — for the small variant this is well under 1M trainable parameters.

Speeds, Sizes, Times

Checkpoint size: ~88 MB for the small variant (best_pose_model.pth), ~350 MB for base.
Inference (CPU): ~15–20 ms/image (small), ~30–40 ms/image (base).
Inference (GPU): roughly 5–10× faster than CPU.

[More Information Needed — wall-clock training time, throughput per epoch]

Evaluation

Testing Data, Factors & Metrics

Testing Data

When training from a single --data_dir, the script performs an 80/20 random split into train/val. When --train_csv and --val_csv are supplied, those are used directly.

[More Information Needed — held-out test set details, if any beyond the val split]

Factors

The natural disaggregations of interest are:

Pose class (8 categories) — adjacent-class confusion is the dominant error mode.
Species (zebra vs giraffe) — coverage and accuracy may differ.
Site / session (e.g., Mpala vs OPC sessions) — proxies for altitude, lighting, and habitat.

[More Information Needed — disaggregated numbers]

Metrics

Top-1 accuracy (overall and per-class).
8×8 confusion matrix (printed by train_pose_classifier.py at the end of training).

Results

Target performance reported in the user guide:

Overall validation accuracy: >85%
Critical front/back classes: >90%

[More Information Needed — actual measured numbers for the released checkpoint, ideally as a confusion matrix figure]

Summary

The small DINOv2 backbone with the MLP head described above is the released configuration and offers a favorable accuracy/latency trade-off for the drone-navigation use case. The base and large variants are supported by the same training script for users with more compute and labeled data.

Model Examination

[More Information Needed — saliency/feature-attribution analysis, if any]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed — GPU model used for training]
Hours used: [More Information Needed]
Cloud Provider: Ohio Supercomputer Center (OSC)
Compute Region: Ohio, USA
Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

Input Image (224×224, RGB, ImageNet-normalized)
    │
    ▼
DINOv2 ViT (frozen)
    - small : ViT-S/14  → 384-d feature vector
    - base  : ViT-B/14  → 768-d feature vector
    - large : ViT-L/14  → 1024-d feature vector
    │
    ▼
MLP head (trainable)
    LayerNorm(feat_dim)
    Linear(feat_dim → 256) + GELU + Dropout(0.3)
    Linear(256 → 128)      + GELU + Dropout(0.3)
    Linear(128 → 8)
    │
    ▼
Logits over {front, front-left, front-right, left, right,
             back-left, back-right, back}

Training objective: cross-entropy with label smoothing (0.1), optimized only over the MLP head parameters.

Compute Infrastructure

Hardware

Training: a single CUDA-capable GPU is sufficient for the small variant; mixed precision is enabled automatically. Larger DINOv2 variants benefit from more GPU memory.
Inference: runs on CPU or a single GPU. CPU is viable for low-throughput on-board use; GPU is recommended for batched offline processing.

Software

Python 3.x
PyTorch (with torch.hub access to facebookresearch/dinov2)
torchvision
pandas, numpy, Pillow, tqdm

Citation

If you use this model, please cite this model repository, the MMLA pose dataset, the associated CV4Animals workshop paper, and the underlying DINOv2 backbone.

Model

@software{imageomics_mmla_dino_pose_2026,
  author = {Sun, Claire and Kline, Jenna and Pillai, Bharath and Berger-Wolf, Tanya},
  title = {MMLA DINOv2 8-Class Animal Pose Classifier},
  year = {2026},
  url = {https://huggingface.co/imageomics/mmla-dino-pose},
  note = {Fine-tuned on the MMLA pose dataset: https://huggingface.co/datasets/imageomics/mmla-pose}
}

Dataset

Please also cite the MMLA pose dataset:

@dataset{imageomics_mmla_pose_2026,
  title = {MMLA Pose Dataset},
  year = {2026},
  url = {https://huggingface.co/datasets/imageomics/mmla-pose}
}

Underlying backbone

@article{oquab2023dinov2,
  title   = {DINOv2: Learning Robust Visual Features without Supervision},
  author  = {Oquab, Maxime and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal = {arXiv preprint arXiv:2304.07193},
  year    = {2023},
  url     = {https://arxiv.org/abs/2304.07193}
}

Acknowledgements

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Compute was provided by the Ohio Supercomputer Center. Ohio Supercomputer Center. 1987. Ohio Supercomputer Center. Columbus OH: Ohio Supercomputer Center. https://ror.org/01apna436.

The backbone model is DINOv2 by Meta AI Research.

Glossary

Pose / viewpoint: the orientation of the animal relative to the camera, discretized here into 8 bins of ~45° each.
Frozen backbone: the DINOv2 weights are fixed during training; gradients flow only through the MLP head.
Symmetry-aware flip: horizontal-flip augmentation paired with a label swap (left↔right, front-left↔front-right, back-left↔back-right) so that flipped images carry geometrically correct labels.