RF-DETR Nano — LiteRT (CompiledModel GPU)

RF-DETR Nano on a Pixel 8a — both transformer graphs on CompiledModel GPU

RF-DETR (Roboflow 2025, an LW-DETR derivative) object detection, converted to LiteRT and running 100% on the CompiledModel GPU (ML Drift) on a phone — the first transformer/DETR detector to ride the LiteRT GPU API with no CPU/ONNX fallback.

RF-DETR is a transformer detector (windowed DINOv2-S backbone + deformable-attention DETR decoder). Off-the-shelf it is GPU-incompatible (deformable grid_sample → GATHER_ND, windowed attention → 5D/6D tensors, two-stage query selection → TOPK/GATHER). Here it is converted with litert-torch and split into two GPU graphs with a tiny host step between them, so the whole detector runs on the GPU.

Files

File	What it is	Size (fp16)
`rfdetr_graphA_fp16.tflite`	backbone + encoder + proposal heads → `enc_class[1,576,91]`, `enc_coord[1,576,4]`, `memory[1,576,256]`	48.6 MB
`rfdetr_graphB_fp16.tflite`	two-stage combine + decoder + heads → `boxes[1,300,4]` (cxcywh), `logits[1,300,91]`	7.6 MB

How it runs (two-graph split)

image[1,3,384,384]
  →[GPU Graph A]→ enc_class, enc_coord, memory
  →[host: top-300 by max class score → gather coords]→ refpoint_ts[1,300,4]
  →[GPU Graph B  (memory, refpoint_ts)]→ boxes[1,300,4], logits[1,300,91]
  →[host: sigmoid + threshold + cxcywh→xyxy + per-class NMS]→ detections

The two-stage query selection (TOPK/GATHER) has no GPU op, but the proposal grid is image-independent, so the model splits at exactly that point — the standard two-stage-DETR edge split. Both graphs are 100% GPU-resident.

On-device (Pixel 8a, Tensor G3 — verified)

graph	nodes on GPU	time
Graph A	`1381/1381` LITERT_CL	~22 ms
Graph B	`404/404` LITERT_CL	~5 ms

Full pipeline ≈ 27 ms (model) / ~100 ms end-to-end incl. host pre/post-processing. On a real image the device chain reproduces the PyTorch detections at IoU 0.98–0.99 with matching class and score.

Preprocessing / outputs

Input: square resize to 384×384, RGB, ImageNet mean/std ([0.485,0.456,0.406]/[0.229,0.224,0.225]), NCHW.
Output: Graph B boxes are cxcywh normalized to [0,1]; logits are 91-way (index = COCO category id). Host applies sigmoid + score threshold + cxcywh→xyxy + per-class NMS.

Conversion notes

Converted with litert-torch (NCHW preserved — onnx2tf destroys ViT attention). Re-authoring (per-graph tflite-vs-torch correlation 1.0): windowed DINOv2 backbone (6D window-partition → ≤4D, SDPA → manual attention), deformable grid_sample → a GATHER/CAST-free tent-matmul, MSDeformAttn ≤4D, baked sine pos-embed, and a down-scaled fp16-safe LayerNorm in the projector and decoder (the Mali delegate computes in fp16, and those LayerNorm channel-sums otherwise overflow). The two-stage topk/gather runs on the host between the two graphs.

A runnable Android sample (CompiledModel GPU) and the conversion scripts are in the official ai-edge-litert/litert-samples object_detection example.

License

Apache-2.0, inherited from roboflow/rf-detr.

Downloads last month: -