apple/MobileCLIP-S2 · The inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU and GPU

Sep 19, 2024

Use the following code to test inference time, the inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU(12th Gen Intel(R) Core(TM) i7-12700K) and GPU(NVIDIA GeForce RTX 3090). Is this expected?

device = "cuda:0"
# device = "cpu"

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s2',
                                                              pretrained='checkpoints/mobileclip_s2.pt',
                                                              device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-256',
                                                             pretrained='datacomp_s34b_b86k',
                                                             device=device)
model.eval()

image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)

infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
    with torch.no_grad(), torch.amp.autocast('cuda'):
        start.record()
        # start_t = time.time()
        image_features = model.encode_image(image)
        end.record()
        torch.cuda.synchronize()
        # end_t = time.time()
    infer_t += start.elapsed_time(end)
    # infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')

mobileclip_s2
torch_gpu 18.90664233722687 ms/frame
torch_cpu 170.6237115383148 ms/frame

openclip_vit_b_32_256
torch_gpu 6.30794669418335 ms/frame
torch_cpu 114.05081667900086 ms/frame

fartashf

Apple org Dec 10, 2024

Please see the response here:

Apologies for the delayed response. We benchmarked our models on the neural engine of the iPhone 12 Pro Max using Core ML. For achieving optimal performance on NVIDIA GPUs, I recommend using TensorRT, as its kernels appear to be more effectively optimized for depthwise/grouped convolutions.

https://huggingface.co/apple/MobileCLIP-S2-OpenCLIP/discussions/3#67588d31d56dc18df9f60f38

fartashf changed discussion status to closed Dec 10, 2024