Instructions to use litert-community/D-FINE-S-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/D-FINE-S-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
D-FINE-S — LiteRT (CompiledModel GPU)
D-FINE (USTC, 2024 — ustc-community/dfine-small-coco), the SOTA
real-time DETR, converted to LiteRT and running 100% on the CompiledModel GPU (ML Drift) on a
phone, with no CPU/ONNX fallback.
D-FINE is a transformer detector — HGNetV2 backbone + a hybrid AIFI/CCFM encoder + an FDR (Fine-grained
Distribution Refinement) decoder. Off-the-shelf it is GPU-incompatible (deformable grid_sample →
GATHER_ND, two-stage query selection → TOPK/GATHER). Here it is converted with litert-torch and
split into two GPU graphs with a host step between them, so both transformer graphs run on the GPU.
Files
| File | What it is | Size (fp16) |
|---|---|---|
dfine_graphA_fp16.tflite |
HGNetV2 backbone + hybrid encoder + score head → enc_class[1,8400,80], memory_raw[1,8400,256] |
13.0 MB |
dfine_graphB_fp16.tflite |
two-stage combine + FDR decoder + heads → boxes[1,300,4] (cxcywh), logits[1,300,80] |
8.8 MB |
host_params.bin |
host per-token tail weights (enc_output + enc_bbox_head), valid mask, anchors (fp32) |
0.9 MB |
coco_labels.txt |
80 contiguous COCO class names (id 0–79) | — |
How it runs (two-graph split)
image[1,3,640,640]
→[GPU Graph A]→ enc_class, memory_raw
→[host: top-300 by max class score; per-token tail on the 300 selected (fp32):
target = enc_output(valid·memory_raw) (Linear + LayerNorm)
ref = enc_bbox_head(target) + anchors (3-layer MLP)]
→[GPU Graph B (memory_raw, target, ref)]→ boxes[1,300,4], logits[1,300,80]
→[host: sigmoid + threshold + cxcywh→xyxy + light NMS]→ detections
The on-device gate — a Mali 3D-sequence fan-out bug (NOT the FDR decoder)
A naïve Graph A (emitting enc_class/enc_coord/output_memory/memory_raw together) gave 0 detections
on device, and it first looked like the FDR decoder collapsing in fp16. That was a red herring. The real
cause is a Mali delegate bug: a 3-D token tensor [1,N,256] (from conv.flatten(2).transpose(1,2)) that
is both a graph output and consumed by another node — or that fans out to several consumers — is silently
clobbered on the longer branch (4-D conv-map outputs are fine). Here the raw memory output (Graph B's
cross-attention input) was garbage (device corr −0.02) → the decoder cross-attended to noise → no detections.
Fix: Graph A emits only the two fp16-clean leaves (enc_class + memory_raw×2) and the per-token tail
(enc_output + enc_bbox_head) runs on the host over the 300 selected tokens (exact, since per-token ops
commute with the gather). With clean memory the FDR decoder is perfect — correlation is not the ship
criterion, real-image detection IoU is.
On-device (Pixel 8a, Tensor G3 — verified)
Both graphs run 100% GPU-resident (LITERT_CL): Graph A 511/511, Graph B 850/850. On a COCO val image
(giraffe + cows) the device chain reproduces the PyTorch detections at IoU 0.99–1.00 with matching class
and score. End-to-end ~450 ms/frame — accurate and fully-GPU but not real-time on this device (the
deformable decoder over the 8400 tokens / 80×80 levels is GPU-compute-bound; the GATHER-free tent-matmul
grid_sample turns an O(points) gather into an O(H·W) matmul). For a real-time camera DETR see
RF-DETR Nano.
Preprocessing / outputs
- Input: square resize to 640×640, RGB,
[0,1]rescale only (no ImageNet normalization), NCHW. - Output: Graph B
boxesarecxcywhnormalized to[0,1];logitsare 80-way (contiguous COCO id 0–79). Host applies sigmoid + score threshold +cxcywh→xyxy+ light NMS.
Conversion notes
Converted with litert-torch (NCHW preserved — onnx2tf destroys ViT attention). Re-authoring (per-graph
tflite-vs-torch correlation 1.0): deformable grid_sample → a GATHER/CAST-free tent-matmul, multi-level
MSDeformAttn ≤4D, the FDR LQE prob.topk → iterative max-and-mask, distance2bbox stack→cat, baked AIFI
sine pos-embed, a down-scaled fp16-safe LayerNorm, and the 3D-fan-out fix above (emit clean leaves +
host-side per-token tail).
A runnable Android sample (CompiledModel GPU) and the conversion scripts are in the official
ai-edge-litert/litert-samples object_detection example.
License
Apache-2.0, inherited from Peterande/D-FINE.
- Downloads last month
- -
Model tree for litert-community/D-FINE-S-LiteRT
Base model
ustc-community/dfine-small-coco