Instructions to use litert-community/RF-DETR-Nano-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/RF-DETR-Nano-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
RF-DETR Nano β LiteRT (CompiledModel GPU)
RF-DETR (Roboflow 2025, an LW-DETR derivative) object detection, converted to LiteRT and running 100% on the CompiledModel GPU (ML Drift) on a phone β the first transformer/DETR detector to ride the LiteRT GPU API with no CPU/ONNX fallback.
RF-DETR is a transformer detector (windowed DINOv2-S backbone + deformable-attention DETR decoder).
Off-the-shelf it is GPU-incompatible (deformable grid_sample β GATHER_ND, windowed attention β
5D/6D tensors, two-stage query selection β TOPK/GATHER). Here it is converted with litert-torch
and split into two GPU graphs with a tiny host step between them, so the whole detector runs on the
GPU.
Files
| File | What it is | Size (fp16) |
|---|---|---|
rfdetr_graphA_fp16.tflite |
backbone + encoder + proposal heads β enc_class[1,576,91], enc_coord[1,576,4], memory[1,576,256] |
48.6 MB |
rfdetr_graphB_fp16.tflite |
two-stage combine + decoder + heads β boxes[1,300,4] (cxcywh), logits[1,300,91] |
7.6 MB |
How it runs (two-graph split)
image[1,3,384,384]
β[GPU Graph A]β enc_class, enc_coord, memory
β[host: top-300 by max class score β gather coords]β refpoint_ts[1,300,4]
β[GPU Graph B (memory, refpoint_ts)]β boxes[1,300,4], logits[1,300,91]
β[host: sigmoid + threshold + cxcywhβxyxy + per-class NMS]β detections
The two-stage query selection (TOPK/GATHER) has no GPU op, but the proposal grid is
image-independent, so the model splits at exactly that point β the standard two-stage-DETR edge split.
Both graphs are 100% GPU-resident.
On-device (Pixel 8a, Tensor G3 β verified)
| graph | nodes on GPU | time |
|---|---|---|
| Graph A | 1381/1381 LITERT_CL |
~22 ms |
| Graph B | 404/404 LITERT_CL |
~5 ms |
Full pipeline β 27 ms (model) / ~100 ms end-to-end incl. host pre/post-processing. On a real image the device chain reproduces the PyTorch detections at IoU 0.98β0.99 with matching class and score.
Preprocessing / outputs
- Input: square resize to 384Γ384, RGB, ImageNet mean/std (
[0.485,0.456,0.406]/[0.229,0.224,0.225]), NCHW. - Output: Graph B
boxesarecxcywhnormalized to[0,1];logitsare 91-way (index = COCO category id). Host applies sigmoid + score threshold +cxcywhβxyxy+ per-class NMS.
Conversion notes
Converted with litert-torch (NCHW preserved β onnx2tf destroys ViT attention). Re-authoring (per-graph
tflite-vs-torch correlation 1.0): windowed DINOv2 backbone (6D window-partition β β€4D, SDPA β manual
attention), deformable grid_sample β a GATHER/CAST-free tent-matmul, MSDeformAttn β€4D, baked sine
pos-embed, and a down-scaled fp16-safe LayerNorm in the projector and decoder (the Mali delegate
computes in fp16, and those LayerNorm channel-sums otherwise overflow). The two-stage topk/gather runs on
the host between the two graphs.
A runnable Android sample (CompiledModel GPU) and the conversion scripts are in the official
ai-edge-litert/litert-samples object_detection
example.
License
Apache-2.0, inherited from roboflow/rf-detr.
- Downloads last month
- -