Instructions to use sapoepsilon/gemma4-31b-drone-captioner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sapoepsilon/gemma4-31b-drone-captioner with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/root/drone-ai/models/gemma-4-31b-it") model = PeftModel.from_pretrained(base_model, "sapoepsilon/gemma4-31b-drone-captioner") - Notebooks
- Google Colab
- Kaggle
Gemma 4 31B-it — Anti-UAV Scene Captioner (LoRA)
A LoRA adapter for google/gemma-4-31B-it trained to describe still frames from anti-UAV surveillance camera feeds — drone presence, position in frame, sky conditions, and visible scene structure.
Trained as the captioner stage of a chained drone-pipeline: YOLO detector → ByteTrack → Gemma 4 captioner (this).
Training
| base | google/gemma-4-31B-it |
| method | LoRA (4-bit nf4) — Google cookbook recipe (eager attn, bf16 quant storage) |
| LoRA r / α | 16 / 16, target_modules="all-linear" |
| training data | 658 (frame, caption) pairs from Anti-UAV-RGBT, captions produced by Qwen2.5-VL-7B teacher |
| epochs / steps | 2 / 166 |
| effective batch | 8 (1 × grad-accum 8) |
| LR | 2e-4 constant, max_grad_norm 0.3 |
| eval loss | 0.179 (down from 0.241 first eval) |
| eval token accuracy | 93.4% |
| hardware | 3× NVIDIA RTX 3090 (model parallelism via balanced device_map, ~8GB/GPU) |
Use
from peft import PeftModel
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_storage=torch.bfloat16)
base = AutoModelForImageTextToText.from_pretrained(
"google/gemma-4-31B-it",
quantization_config=bnb, attn_implementation="eager",
dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "sapoepsilon/gemma4-31b-drone-captioner")
processor = AutoProcessor.from_pretrained("google/gemma-4-31B-it")
Caveats
- Captions are derived from a VLM teacher (Qwen2.5-VL-7B), not human labels — supervision is noisy and inherits the teacher's biases
- Trained on a narrow distribution: anti-UAV surveillance reticle/HUD imagery (Anti-UAV-RGBT). Out-of-distribution frames may degrade
- Style is fairly templated ("The image shows a drone presence ...") which is intentional for downstream parsing but may sound formulaic
License
Adapter weights: Apache 2.0. Base model retains its original Google Gemma license.
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support