metadata

license: mit
language:
  - en

CLIP ViT-B/32 in OpenVINO™ format

Original model details

The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within.

Model type

The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

OpenVINO optimization

To increase the efficiency of the model during inference, we utilized the OpenVINO™ toolkit for optimization. The table below showcases the inference time improvements achieved with OpenVINO™ compared to the original PyTorch implementation:

Metric	PyTorch Inference Time (seconds)	OpenVINO Inference Time (seconds)	Similarity
Mean	0.518564	0.461107	1
Standard Deviation	0.109119	0.0917191	0
Min	0.390102	0.360006	1
Max	0.699677	0.620042	1

The results indicate that the OpenVINO™ optimization provides a consistent improvement in inference time while maintaining the same level of accuracy (as indicated by the similarity score).

Usage

You can utilize this optimized model for faster inferences in environments where time is a critical factor. Ensure you have the necessary libraries and dependencies installed to leverage the power of OpenVINO™.