Instructions to use JDONE-Research/AIOne-GeoSeg-330M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JDONE-Research/AIOne-GeoSeg-330M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="JDONE-Research/AIOne-GeoSeg-330M", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("JDONE-Research/AIOne-GeoSeg-330M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
AIOne-GeoSeg-330M
A 330M-parameter Vision Transformer with our TCAM decoder for Korean land-cover semantic segmentation on aerial and satellite imagery.
π Paper (coming soon)
Model Description
AIOne-GeoSeg-330M is a semantic segmentation model that classifies every pixel of a Korean aerial / satellite ortho-image into one of 11 land-cover categories (background, building, parking lot, road, street tree, paddy, greenhouse, dry field, forest, bare land, farmland).
The model pairs a DINOv3 ViT-L/16 backbone (sat-493M pretrain) with TCAM (Terrain-Context Adaptive Mutual-attention) Head β a segmentation decoder we designed in-house for Korean remote-sensing imagery. TCAM infers an implicit terrain context (mountain / plain / urban) from RGB alone, picks the right backbone scale per pixel, and runs a bidirectional cross-attention between learnable class prototypes and spatial features so that visually similar classes such as paddy / dry field / farmland or street tree / forest can be separated reliably.
- Custom TCAM head. 6 in-house submodules: FeatureBridge, TCE, SAA, CFMA Γ3, BoundaryGuidedUpsampler, Classifier. +8.4 % relative mIoU over DPT on the same backbone and data.
- DINOv3-ViT-L/16 backbone. 24 layers, 1024 hidden, 4 register tokens, satellite-pretrained (sat-493M), unfrozen during training.
- Single-shot, full-resolution mask. 512Γ512 RGB input β 512Γ512 per-class logits, no sliding window required.
- HF Transformers compatible. Loads via
AutoModelwithtrust_remote_code=True; ships anAIOne_GeoSegImageProcessor.
Key Capabilities
- Pixel-wise classification of Korean ortho-imagery into 11 land-cover classes.
- Robust separation of similar classes (paddy / dry field / farmland, street tree / forest) via class-prototype mutual attention.
- Sharp object and parcel boundaries via boundary-guided upsampling β useful for buildings, roads, and field edges.
- Built-in Korean label set and per-class RGB palette for direct visualization.
- One-call loading through
transformers.AutoModel/AutoImageProcessor.
Classes
| ID | Korean | English | RGB |
|---|---|---|---|
| 0 | λ°°κ²½ | Background | (0, 0, 0) |
| 1 | 건물 | Building | (184, 131, 237) |
| 2 | μ£Όμ°¨μ₯ | Parking lot | (16, 64, 178) |
| 3 | λλ‘ | Road | (42, 65, 247) |
| 4 | κ°λ‘μ | Street tree | (200, 229, 155) |
| 5 | λ Ό | Paddy | (191, 255, 255) |
| 6 | λΉλνμ°μ€ | Greenhouse | (220, 240, 255) |
| 7 | λ° | Dry field | (102, 249, 247) |
| 8 | μ°λ¦Ό | Forest | (45, 75, 42) |
| 9 | λμ§ | Bare land | (255, 242, 159) |
| 10 | λκ²½μ§ | Farmland | (210, 180, 140) |
TCAM Head
TCAM = Terrain-Context Adaptive Mutual-attention. It takes multi-layer hidden states from DINOv3-ViT-L/16 at stages {5, 11, 17, 23} and decodes them through 6 submodules:
| # | Submodule | Role | Core idea |
|---|---|---|---|
| 1 | FeatureBridge Γ4 | Reshape ViT hidden states (B, N, C) to 2D maps (B, D, H, W) |
CLS readout is projected and concatenated to patch tokens, then 1Γ1 conv compresses channels |
| 2 | TCE (Terrain Context Estimator) | Infer an implicit terrain context (mountain / plain / urban) from the deepest feature; output FiLM parameters (Ξ³, Ξ²) | GAP β MLP. No DEM required β Ξ³ is initialized with a +1.0 residual for identity start |
| 3 | SAA (Scale-Adaptive Aggregation) | Per-pixel softmax fusion of the 4 scales | FiLM-modulated by (Ξ³, Ξ²) so that forest pixels lean on deep layers and building / road pixels lean on shallow layers automatically |
| 4 | CFMA Γ3 (ClassβFeature Mutual Attention) | Bidirectional cross-attention between learnable class prototypes and the fused spatial feature | (1) Class β Feature: each class embedding queries the spatial feature; (2) Feature β Class: spatial features re-query the refreshed prototypes. Major gain on confusable classes (paddy / dry field / farmland, street tree / forest) |
| 5 | BoundaryGuidedUpsampler | Extract a boundary attention map (sigmoid) from a shallow feature and run 4-stage ConvTranspose for 16Γ upsample | Recovers sharp parcel and mountain-ridge boundaries |
| 6 | Classifier | 3Γ3 conv β 1Γ1 conv | Pixel-wise logits |
Pipeline: FeatureBridge Γ4 β TCE β SAA β CFMA Γ3 β BoundaryGuidedUpsampler β Classifier
Parameter Breakdown
| Module | Params |
|---|---|
| FeatureBridge Γ 4 | 9.98 M |
| TCE | 0.26 M |
| SAA | 0.59 M |
| CFMA Γ 3 | 7.11 M |
| BoundaryGuidedUpsampler | 8.10 M |
| Classifier | 1.33 M |
| TCAM Head total | 27.37 M |
| Backbone (DINOv3-ViT-L/16) | 303.13 M |
| Total | 330.50 M |
vs DPT head: +3.24 M (+13.4 % head, +0.99 % total).
TCAM Hyperparameters
| Field | Value |
|---|---|
tcam_hidden_size (D) |
384 |
tcam_num_heads (CFMA) |
12 |
tcam_cfma_layers |
3 |
tcam_tce_hidden_size |
192 |
tcam_boundary_channels |
64 |
readout_type |
project |
Training
| Item | Value |
|---|---|
| Dataset | Korean aerial photography + AI Hub land-cover |
| Classes | 11 (background + 10 land-cover) |
| Input | 512 Γ 512, 3-channel RGB |
| Backbone | DINOv3-ViT-L/16 (sat-493M pretrain), unfrozen |
| Selected layers | {5, 11, 17, 23} |
| Patch size | 16 (32 Γ 32 token grid) |
| Loss | Focal (Ξ³ = 2.5) + Dice |
| Optimizer | AdamW, lr 1.3e-4, wd 0.01 |
| Scheduler | CosineAnnealing, Ξ·_min = 1e-6 |
| Batch size | 32 |
| Mixed precision | bf16 |
| Sampler | Weighted (class-balanced) |
| Epochs | 40 |
Results
TCAM vs DPT (same backbone, same data, 40 epochs)
| Head | Params | Best Val mIoU | Best Epoch |
|---|---|---|---|
| DPT | 24.13 M | 0.6505 | 39 |
| TCAM (ours) | 27.37 M | 0.7054 | 40 |
| Ξ | +3.24 M | +0.0549 (+8.4 % rel.) | β |
Training Dynamics
- Epochs 1β10 (warm-up). mIoU oscillates between 0.53β0.66; train loss falls quickly from 0.486 to 0.361.
- Epochs 11β35 (stabilization). mIoU climbs monotonically 0.64 β 0.697; focal + dice combination drives minor-class learning.
- Epochs 36β40 (fine convergence). With LR in the 3e-5 β 1.6e-5 range, mIoU plateaus at 0.694 β 0.705.
Quick Start
Installation
pip install "transformers>=4.45" torch pillow
Inference
import torch
import numpy as np
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
MODEL_ID = "JDONE-Research/AIOne-GeoSeg-330M"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device).eval()
processor = AutoImageProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
image = Image.open("aerial.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
logits = model(pixel_values=inputs["pixel_values"]).logits # (1, 11, 512, 512)
mask = logits.argmax(dim=1)[0].cpu().numpy() # (512, 512) int class IDs
id2label = model.config.id2label
print("Detected classes:", sorted({id2label[str(i)] for i in np.unique(mask)}))
Colorized mask
palette = np.array(model.config.label_colors, dtype=np.uint8) # (11, 3)
color_mask = palette[mask] # (512, 512, 3)
Image.fromarray(color_mask).save("mask.png")
# Side-by-side overlay (50% blend)
resized = image.resize((512, 512))
overlay = (np.array(resized) * 0.5 + color_mask * 0.5).astype(np.uint8)
Image.fromarray(overlay).save("overlay.png")
Batch inference
images = [Image.open(p).convert("RGB") for p in paths]
inputs = processor(images=images, return_tensors="pt").to(device)
with torch.no_grad():
masks = model(pixel_values=inputs["pixel_values"]).logits.argmax(dim=1).cpu().numpy()
# masks: (B, 512, 512)
Model Specs
| Field | Value |
|---|---|
| Architecture | AIOne_GeoSeg |
| Backbone | DINOv3 ViT-L/16, sat-493M pretrain (24 layers, 1024 hidden, 16 heads, 4 register tokens) |
| Segmentation head | TCAM (custom) β FeatureBridge Γ4, TCE, SAA, CFMA Γ3, BoundaryGuidedUpsampler, Classifier |
| Total parameters | 330.50 M (Backbone 303.13 M + TCAM Head 27.37 M) |
| Weights on disk | 1.3 GB (FP32) |
| Input | RGB image, 512Γ512 |
| Output | Per-pixel logits, shape (B, 11, 512, 512) |
| Backbone feature taps | stages [5, 11, 17, 23] |
| Number of classes | 11 (Korean land-cover) |
| Validation mIoU | 0.7054 |
| Training precision | bf16 |
| Released checkpoint precision | float32 |
| Domain | Korean aerial / satellite ortho-imagery |
Intended Use
- Korean land-cover mapping from aerial or satellite ortho-imagery.
- Change-detection pipelines (run the model on two epochs and diff the masks).
- Urban-planning, agriculture, and forestry analytics that need per-pixel Korean class labels.
- Research baseline for comparing other segmentation heads against TCAM.
Out-of-Scope Use
- Non-commercial only. This release is governed by CC-BY-NC-4.0; do not use it in revenue-generating products or services.
- Imagery from regions or sensors that differ substantially from the Korean ortho-imagery training distribution (expect degraded accuracy).
- Sole-source decision-making in legal, regulatory, or safety-critical contexts.
- Any analysis that infringes on personal privacy, property rights, or applicable geospatial-data regulations.
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC-4.0) license.
- Free to use, share, and adapt for non-commercial purposes with attribution.
- Not permitted for commercial use. Contact the authors for a commercial license.
- Provided "as is" without warranties of any kind.
Citation
If you use AIOne-GeoSeg in your research, please cite:
@misc{aione_geoseg_330m,
title = {AIOne-GeoSeg-330M: A DINOv3 Vision Transformer with TCAM Head for Korean Land-Cover Segmentation},
author = {JDONE Research},
year = {2026},
howpublished = {\url{https://huggingface.co/JDONE-Research/AIOne-GeoSeg-330M}}
}
A paper describing the TCAM head, training procedure, and full ablations will be released soon β citation details will be updated here when available.
- Downloads last month
- -