The inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU and GPU
Use the following code to test inference time, the inference speed of MobileCLIP-S2's image encoder is slower than OpenCLIP's ViT-B-32-256 model on both CPU(12th Gen Intel(R) Core(TM) i7-12700K) and GPU(NVIDIA GeForce RTX 3090). Is this expected?
device = "cuda:0"
# device = "cpu"
model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s2',
pretrained='checkpoints/mobileclip_s2.pt',
device=device)
model.eval()
image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)
infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
with torch.no_grad(), torch.amp.autocast('cuda'):
start.record()
# start_t = time.time()
image_features = model.encode_image(image)
end.record()
torch.cuda.synchronize()
# end_t = time.time()
infer_t += start.elapsed_time(end)
# infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-256',
pretrained='datacomp_s34b_b86k',
device=device)
model.eval()
image = Image.open("docs/fig_accuracy_latency.png").convert('RGB')
image = preprocess(image).unsqueeze(0).to(device)
infer_t = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for _ in range(10000):
with torch.no_grad(), torch.amp.autocast('cuda'):
start.record()
# start_t = time.time()
image_features = model.encode_image(image)
end.record()
torch.cuda.synchronize()
# end_t = time.time()
infer_t += start.elapsed_time(end)
# infer_t += end_t - start_t
print(f'inference speed: {infer_t / 10000} ms/frame')
mobileclip_s2
torch_gpu 18.90664233722687 ms/frame
torch_cpu 170.6237115383148 ms/frame
openclip_vit_b_32_256
torch_gpu 6.30794669418335 ms/frame
torch_cpu 114.05081667900086 ms/frame
Please see the response here:
Apologies for the delayed response. We benchmarked our models on the neural engine of the iPhone 12 Pro Max using Core ML. For achieving optimal performance on NVIDIA GPUs, I recommend using TensorRT, as its kernels appear to be more effectively optimized for depthwise/grouped convolutions.
https://huggingface.co/apple/MobileCLIP-S2-OpenCLIP/discussions/3#67588d31d56dc18df9f60f38