RF-DETR Nano β€” LiteRT (CompiledModel GPU)

RF-DETR Nano on a Pixel 8a β€” both transformer graphs on CompiledModel GPU

RF-DETR (Roboflow 2025, an LW-DETR derivative) object detection, converted to LiteRT and running 100% on the CompiledModel GPU (ML Drift) on a phone β€” the first transformer/DETR detector to ride the LiteRT GPU API with no CPU/ONNX fallback.

RF-DETR is a transformer detector (windowed DINOv2-S backbone + deformable-attention DETR decoder). Off-the-shelf it is GPU-incompatible (deformable grid_sample β†’ GATHER_ND, windowed attention β†’ 5D/6D tensors, two-stage query selection β†’ TOPK/GATHER). Here it is converted with litert-torch and split into two GPU graphs with a tiny host step between them, so the whole detector runs on the GPU.

Files

File What it is Size (fp16)
rfdetr_graphA_fp16.tflite backbone + encoder + proposal heads β†’ enc_class[1,576,91], enc_coord[1,576,4], memory[1,576,256] 48.6 MB
rfdetr_graphB_fp16.tflite two-stage combine + decoder + heads β†’ boxes[1,300,4] (cxcywh), logits[1,300,91] 7.6 MB

How it runs (two-graph split)

image[1,3,384,384]
  β†’[GPU Graph A]β†’ enc_class, enc_coord, memory
  β†’[host: top-300 by max class score β†’ gather coords]β†’ refpoint_ts[1,300,4]
  β†’[GPU Graph B  (memory, refpoint_ts)]β†’ boxes[1,300,4], logits[1,300,91]
  →[host: sigmoid + threshold + cxcywh→xyxy + per-class NMS]→ detections

The two-stage query selection (TOPK/GATHER) has no GPU op, but the proposal grid is image-independent, so the model splits at exactly that point β€” the standard two-stage-DETR edge split. Both graphs are 100% GPU-resident.

On-device (Pixel 8a, Tensor G3 β€” verified)

graph nodes on GPU time
Graph A 1381/1381 LITERT_CL ~22 ms
Graph B 404/404 LITERT_CL ~5 ms

Full pipeline β‰ˆ 27 ms (model) / ~100 ms end-to-end incl. host pre/post-processing. On a real image the device chain reproduces the PyTorch detections at IoU 0.98–0.99 with matching class and score.

Preprocessing / outputs

  • Input: square resize to 384Γ—384, RGB, ImageNet mean/std ([0.485,0.456,0.406]/[0.229,0.224,0.225]), NCHW.
  • Output: Graph B boxes are cxcywh normalized to [0,1]; logits are 91-way (index = COCO category id). Host applies sigmoid + score threshold + cxcywhβ†’xyxy + per-class NMS.

Conversion notes

Converted with litert-torch (NCHW preserved β€” onnx2tf destroys ViT attention). Re-authoring (per-graph tflite-vs-torch correlation 1.0): windowed DINOv2 backbone (6D window-partition β†’ ≀4D, SDPA β†’ manual attention), deformable grid_sample β†’ a GATHER/CAST-free tent-matmul, MSDeformAttn ≀4D, baked sine pos-embed, and a down-scaled fp16-safe LayerNorm in the projector and decoder (the Mali delegate computes in fp16, and those LayerNorm channel-sums otherwise overflow). The two-stage topk/gather runs on the host between the two graphs.

A runnable Android sample (CompiledModel GPU) and the conversion scripts are in the official ai-edge-litert/litert-samples object_detection example.

License

Apache-2.0, inherited from roboflow/rf-detr.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support