GeoMTConvNeXt — embed2heights Multi-Task Geospatial Model
GeoMTConvNeXt is a multi-task geospatial prediction model trained on the ESA/ITU GeoFM embed2heights benchmark (Belgium/Netherlands).
It predicts four geospatial tasks simultaneously from multi-source satellite embeddings: building cover · vegetation cover · water cover · height in metres.
| Parameters | 52 M |
| Backbone | ConvNeXt-Tiny (ImageNet-1K pretrained, fine-tuned) |
| Val score | 0.468 (competition metric, fold 0) |
| Input resolution | 256 × 256 px tiles |
Architecture
Backbone adapter: A learned Conv2d(192→3) projects the concatenated AlphaEarth+Tessera pixel embeddings to pseudo-RGB, normalised to ImageNet statistics, making ImageNet pretrained weights directly applicable.
Patch-context injection: TerraMind and THOR patch embeddings (16×16 spatial resolution, 3072 total channels) are encoded and injected at the ConvNeXt bottleneck, providing global geographic context for locally ambiguous pixels.
Ordinal height regression: Height is predicted as a soft expectation over 64 uniformly spaced bins, making gradients smooth across the full height range and handling the highly skewed building height distribution.
Quickstart
pip install huggingface_hub numpy torch torchvision
Fast inference (embedding cache — 2 851 known tiles, O(1))
from inference import GeoMTConvNeXtInference
import numpy as np
model = GeoMTConvNeXtInference("Abdoul27/embed2heights-geoconvnext")
batch = {
"alphaearth_emb" : np.load("3001_BE.npz")["alphaearth_emb"], # [64, 256, 256]
"tessera_emb" : np.load("3001_BE.npz")["tessera_emb"], # [128, 256, 256]
"terramind_s1_emb": np.load("3001_BE.npz")["terramind_s1_emb"], # [768, 16, 16]
"terramind_s2_emb": np.load("3001_BE.npz")["terramind_s2_emb"], # [768, 16, 16]
"thor_s1_emb" : np.load("3001_BE.npz")["thor_s1_emb"], # [768, 16, 16]
"thor_s2_emb" : np.load("3001_BE.npz")["thor_s2_emb"], # [768, 16, 16]
}
pred = model(batch)
pred.building_cover # [256, 256] ∈ [0, 1]
pred.vegetation_cover # [256, 256] ∈ [0, 1]
pred.water_cover # [256, 256] ∈ [0, 1]
pred.height # [256, 256] in metres
pred.array # [4, 256, 256]
pred.source # "cache" | "model"
Full model inference (any tile, GPU recommended)
model = GeoMTConvNeXtInference(
"Abdoul27/embed2heights-geoconvnext",
device="cuda"
)
# Same call — automatically falls back to GeoMTConvNeXt forward pass
# for tiles not present in the cache
pred = model(batch)
Load model weights directly
import torch
from model import GeoMTConvNeXt
net = GeoMTConvNeXt(base=64, pretrained=False)
ck = torch.load("model.pt", map_location="cpu", weights_only=True)
net.load_state_dict(ck["model"])
net.eval()
# batch: dict of torch.Tensors with batch dimension
out, h_logits, seg_logits, aux = net(batch)
# out: [B, 4, 256, 256] — cover (sigmoid) + height (metres)
Local / offline use
model = GeoMTConvNeXtInference.from_local("path/to/repo/", device="cuda")
pred = model(batch)
Repository contents
| File | Size | Description |
|---|---|---|
model.py |
— | GeoMTConvNeXt architecture (self-contained, no project deps) |
model.pt |
~200 MB | Trained weights (best checkpoint, fold 0) |
predictions.npz |
337 MB | Embedding-signature cache for 2 851 tiles |
inference.py |
— | Unified inference interface (cache + model fallback) |
Competition context
Trained for the ESA/ITU GeoFM embed2heights challenge (closes 2026-06-30).
Scoring metric:
0.25 × IoU_bld + 0.15 × IoU_veg + 0.15 × IoU_wtr
+ 0.25 × (1 − RMSE_bld / 3)
+ 0.20 × (1 − RMSE_veg / 5)
License
CC-BY-4.0.
