RF-DETR β€” Core AI (.aimodel)

RF-DETR (Roboflow's real-time detection transformer, COCO-pretrained) converted to Apple Core AI for iOS 27 / macOS 27 β€” the answer to apple/coreai-models#14. DETR family = no NMS: post-processing is one sigmoid + top-k.

RF-DETR medium on Core AI

Files

file input params M4 Max GPU iPhone 17 Pro GPU
rfdetr-nano_float32.aimodel 384Γ—384 30.5M 8.6 ms (~116 FPS) ~25 ms (33–39 FPS live)
rfdetr-small_float32.aimodel 512Γ—512 32.1M 12.0 ms (~83 FPS) β€”
rfdetr-medium_float32.aimodel 576Γ—576 33.7M 14.8 ms (~68 FPS) 56–63 ms (15–17 FPS live)
rfdetr-large_float32.aimodel 704Γ—704 33.9M 19.1 ms (~52 FPS) β€”

iPhone numbers are end-to-end live-camera measurements from the CoreAIKit DetectCamera example (Release; zero-copy capture pipeline β€” AVCaptureVideoPreviewLayer display, hardware-scaled 32BGRA buffers, vImage preprocessing overlapped with GPU inference). Peak measured 39.6 FPS β‰ˆ the nano model ceiling; sustained max-load throughput drops on a hot chassis (thermal).

fp32 is the ship dtype: it gates detection-set exact vs the PyTorch fp32 reference on CPU and GPU (per confident detection: same class, IoU β‰₯ 0.999 measured, score within 2e-3), and fp16 only bought ~7% latency on M4 Max while adding near-tie ranking noise.

Graph contract

input  "image"  [1, 3, R, R]  float32, RGB in [0, 1]  (ImageNet mean/std folded in-graph)
output "dets"   [1, 300, 4]   boxes, cxcywh normalized to [0, 1]
output "labels" [1, 300, 91]  raw class logits; column index = ORIGINAL COCO id (0 unused, 1=person … 17=cat … 90)

Python decode sketch (Swift is the same three steps):

import numpy as np, coreai.runtime as rt

model = await rt.AIModel.load(path, rt.SpecializationOptions.default())
fn = model.load_function("main")
out = await fn({"image": rt.NDArray(rgb01)})          # rgb01: [1,3,R,R] in [0,1]
prob = 1 / (1 + np.exp(-out["labels"].numpy()[0]))    # [300, 91]
scores, classes = prob.max(-1), prob.argmax(-1)       # column index IS the COCO id
boxes = out["dets"].numpy()[0]                        # cxcywh, multiply by image W/H
keep = scores > 0.5                                   # done β€” no NMS

RF-DETR-Seg (instance segmentation)

rfdetr-seg-{nano,small,medium,large,xlarge,2xlarge}_float32.aimodel β€” same contract plus masks [1, Q, R/4, R/4]: per-query FULL-FRAME logit planes at stride 4 (host: sigmoid > 0.5; no ROI plumbing, no NMS). All six gate on CPU and GPU with binary-mask IoU 1.000 on stable scenes. M4 Max GPU: seg-nano 312Β² 10.7 ms β†’ seg-2xlarge 768Β² 59.1 ms.

RF-DETR-Seg nano on Core AI

Split deployment (split/)

split/rfdetr-{nano,medium}_{backbone,head}.aimodel separate the pure-ViT backbone (image β†’ features) from the deformable head (features β†’ dets/labels; position encodings baked in). The chain is bit-exact vs the monolith. Purpose: per-stage compute-unit preferences β€” e.g. backbone on the Neural Engine. Measured honestly: on iOS 27 beta the runtime still executes the backbone on the GPU delegate even under .neuralEngine preference (identical detection fingerprint, no ANE-compile pause), so today the monolith on GPU is the fastest config; the split exists so ANE placement can be adopted the moment the runtime honors it. Regenerate with export_rf_detr.py --variant <v> --split.

Conversion

Exported with conversion/export_rf_detr.py from rfdetr==1.7.1 weights. The port surfaced four Core AI converter/runtime bugs (float-arg arange abort, int64-comparison buffer clobber, GPU-delegate floor/trunc/ceil = identity, cast-pair cancellation) β€” each worked around numerically identically; details and minimal repros in zoo/rf-detr.md.

License: Apache-2.0 (upstream RF-DETR code and COCO-pretrained weights are Apache-2.0).

Downloads last month
97
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support