Convert the model to torchscript or ONNX

I would like to run the model, featureExtractor and tokenizer in c++.
So i am looking to convert it to torchscript , i load them with the parameter torchscript=true as below.

model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning", torchscript=True)
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning", torchscript=True)
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning", torchscript=True)

But i cant seem to figure out what parameters to pass to the trace function.
traced_model = torch.jit.trace(model, WHAT_INPUT_TO_PASS))

I did try passing pixel_values generated by feature_extractor and a random tensor or shape (1, 16). But the model that is traced seems to be incorrect.

I tried converting it to ONNX but "Export a custom model for an unsupported architecture." seemed very confusing.

Any guidance will be deeply appreciated.

Prabesh Khadka

Hi @prabeshkhadka

First of all my suggestion would be go to this blog:
It will make you understand more about training and inference of vision encoder decoder based models.

This is how you can do it.

url = ""
image =, stream=True).raw)
pixel_values = feature_extractor(image, return_tensors="pt").pixel_values

labels = tokenizer(
     "an image of two cats chilling on a couch",

traced_model = torch.jit.trace(model, [pixel_values, labels]), "")

# load model
loaded_model = torch.jit.load("")

you may see running colab notebook:

Thank you so much. This works like a charm.

Hi what if I have a vit image classifier like nateraw/vit-age-classifier how would I convert it to torch script?

