GPT2

This repository contains GPT2 onnx models compatible with TensorRT:

  • gpt2-xl.onnx - GPT2-XL onnx for fp32 or fp16 engines
  • gpt2-xl-i8.onnx - GPT2-XL onnx for int8+fp32 engines

Quantization of models was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.

Metrics:

GPT2-XL

TensorRT INT8+FP32 torch FP16
Lambada Acc 72.11% 71.43%

Test environment

  • GPU RTX 4090
  • CPU 11th Gen Intel(R) Core(TM) i7-11700K
  • TensorRT 8.5.3.1
  • pytorch 1.13.1+cu116

Latency:

GPT2-XL

Input sequance length Number of generated tokens TensorRT INT8+FP32 ms torch FP16 ms Acceleration
64 64 462 1190 2.58
64 128 920 2360 2.54
64 256 1890 4710 2.54

Test environment

  • GPU RTX 4090
  • CPU 11th Gen Intel(R) Core(TM) i7-11700K
  • TensorRT 8.5.3.1
  • pytorch 1.13.1+cu116

How to use

Example of inference and accuracy test published on github:

git clone https://github.com/ENOT-AutoDL/ENOT-transformers
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ENOT-AutoDL/gpt2-tensorrt