Faster Segement Anything (MobileSAM)

MobileSAM performs on par with the original SAM (at least visually) and keeps exactly the same pipeline as the original SAM except for a change on the image encoder. Specifically, we replace the original heavyweight ViT-H encoder (632M) with a much smaller Tiny-ViT (5M). On a single GPU, MobileSAM runs around 12ms per image: 8ms on the image encoder and 4ms on the mask decoder.

The comparison of ViT-based image encoder is summarzed as follows:

Image Encoder Original SAM MobileSAM
Paramters 611M 5M
Speed 452ms 8ms

Original SAM and MobileSAM have exactly the same prompt-guided mask decoder:

Mask Decoder Original SAM MobileSAM
Paramters 3.876M 3.876M
Speed 4ms 4ms

The comparison of the whole pipeline is summarzed as follows:

Whole Pipeline (Enc+Dec) Original SAM MobileSAM
Paramters 615M 9.66M
Speed 456ms 12ms


