D-FINE-S — LiteRT (CompiledModel GPU)

D-FINE-S on a Pixel 8a — both transformer graphs on CompiledModel GPU

D-FINE (USTC, 2024 — ustc-community/dfine-small-coco), the SOTA real-time DETR, converted to LiteRT and running 100% on the CompiledModel GPU (ML Drift) on a phone, with no CPU/ONNX fallback.

D-FINE is a transformer detector — HGNetV2 backbone + a hybrid AIFI/CCFM encoder + an FDR (Fine-grained Distribution Refinement) decoder. Off-the-shelf it is GPU-incompatible (deformable grid_sample → GATHER_ND, two-stage query selection → TOPK/GATHER). Here it is converted with litert-torch and split into two GPU graphs with a host step between them, so both transformer graphs run on the GPU.

Files

File	What it is	Size (fp16)
`dfine_graphA_fp16.tflite`	HGNetV2 backbone + hybrid encoder + score head → `enc_class[1,8400,80]`, `memory_raw[1,8400,256]`	13.0 MB
`dfine_graphB_fp16.tflite`	two-stage combine + FDR decoder + heads → `boxes[1,300,4]` (cxcywh), `logits[1,300,80]`	8.8 MB
`host_params.bin`	host per-token tail weights (`enc_output` + `enc_bbox_head`), valid mask, anchors (fp32)	0.9 MB
`coco_labels.txt`	80 contiguous COCO class names (id 0–79)	—

How it runs (two-graph split)

image[1,3,640,640]
  →[GPU Graph A]→ enc_class, memory_raw
  →[host: top-300 by max class score; per-token tail on the 300 selected (fp32):
          target = enc_output(valid·memory_raw)   (Linear + LayerNorm)
          ref    = enc_bbox_head(target) + anchors (3-layer MLP)]
  →[GPU Graph B  (memory_raw, target, ref)]→ boxes[1,300,4], logits[1,300,80]
  →[host: sigmoid + threshold + cxcywh→xyxy + light NMS]→ detections

The on-device gate — a Mali 3D-sequence fan-out bug (NOT the FDR decoder)

A naïve Graph A (emitting enc_class/enc_coord/output_memory/memory_raw together) gave 0 detections on device, and it first looked like the FDR decoder collapsing in fp16. That was a red herring. The real cause is a Mali delegate bug: a 3-D token tensor [1,N,256] (from conv.flatten(2).transpose(1,2)) that is both a graph output and consumed by another node — or that fans out to several consumers — is silently clobbered on the longer branch (4-D conv-map outputs are fine). Here the raw memory output (Graph B's cross-attention input) was garbage (device corr −0.02) → the decoder cross-attended to noise → no detections.

Fix: Graph A emits only the two fp16-clean leaves (enc_class + memory_raw×2) and the per-token tail (enc_output + enc_bbox_head) runs on the host over the 300 selected tokens (exact, since per-token ops commute with the gather). With clean memory the FDR decoder is perfect — correlation is not the ship criterion, real-image detection IoU is.

On-device (Pixel 8a, Tensor G3 — verified)

Both graphs run 100% GPU-resident (LITERT_CL): Graph A 511/511, Graph B 850/850. On a COCO val image (giraffe + cows) the device chain reproduces the PyTorch detections at IoU 0.99–1.00 with matching class and score. End-to-end ~450 ms/frame — accurate and fully-GPU but not real-time on this device (the deformable decoder over the 8400 tokens / 80×80 levels is GPU-compute-bound; the GATHER-free tent-matmul grid_sample turns an O(points) gather into an O(H·W) matmul). For a real-time camera DETR see RF-DETR Nano.

Preprocessing / outputs

Input: square resize to 640×640, RGB, [0,1] rescale only (no ImageNet normalization), NCHW.
Output: Graph B boxes are cxcywh normalized to [0,1]; logits are 80-way (contiguous COCO id 0–79). Host applies sigmoid + score threshold + cxcywh→xyxy + light NMS.

Conversion notes

Converted with litert-torch (NCHW preserved — onnx2tf destroys ViT attention). Re-authoring (per-graph tflite-vs-torch correlation 1.0): deformable grid_sample → a GATHER/CAST-free tent-matmul, multi-level MSDeformAttn ≤4D, the FDR LQE prob.topk → iterative max-and-mask, distance2bbox stack→cat, baked AIFI sine pos-embed, a down-scaled fp16-safe LayerNorm, and the 3D-fan-out fix above (emit clean leaves + host-side per-token tail).

A runnable Android sample (CompiledModel GPU) and the conversion scripts are in the official ai-edge-litert/litert-samples object_detection example.

License

Apache-2.0, inherited from Peterande/D-FINE.

Downloads last month: -

Model tree for litert-community/D-FINE-S-LiteRT

Base model

ustc-community/dfine-small-coco

Finetuned

(13)

this model