metadata

license: mit
pipeline_tag: text-generation
tags:
  - ONNX
  - DML
  - ONNXRuntime
  - phi3
  - nlp
  - conversational
  - custom_code
inference: false

Phi-3 Small-128K-Instruct ONNX CUDA models

This repository hosts the optimized versions of Phi-3-small-128k-instruct to accelerate inference with ONNX Runtime for your machines with NVIDIA GPUs.

Phi-3 Small is a 7B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets, which include both synthetic data and filtered publicly available website data, with a focus on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the small version in two variants: 8K and 128K, which are the context lengths (in tokens) that they can support.

The base model has undergone a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context, and logical reasoning, Phi-3-Small-128K-Instruct showcased a robust and state-of-the-art performance among models of the same-size and next-size-up.

Optimized variants of the Phi-3 Small models are published here in ONNX format and run with ONNX Runtime on GPU across devices, including server platforms, Windows, and Linux.

ONNX Models

Here are some of the optimized configurations we have added:

ONNX model for FP16 CUDA: ONNX model for NVIDIA GPUs.
ONNX model for INT4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN.

Note: Using the Hugging Face CLI, you can download sub folders and not all models if you are limited on disk space. The FP16 model is recommended for larger batch sizes, while the INT4 model optimizes performance for lower batch sizes.

Example:

# Download just the FP16 model
$ huggingface-cli download microsoft/Phi-3-small-128k-instruct-onnx-cuda --include cuda-fp16/* --local-dir .  --local-dir-use-symlinks False

How to Get Started with the Model

To support the Phi-3 models across a range of devices, platforms, and EP backends, we introduce a new API to wrap several aspects of generative AI inferencing. This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps here. You can also test the models with this chat app.

Hardware Supported

The ONNX models are tested on:

1 A100 GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)

Minimum Configuration Required:

CUDA: Streaming Multiprocessors (SMs) >= 70 (i.e. V100 or newer)

Model Description

Developed by: Microsoft
Model type: ONNX
Language(s) (NLP): Python, C, C++
License: MIT
Model Description: This is a conversion of the Phi-3 Small-128K-Instruct model for ONNX Runtime inference.

Additional Details

Performance Metrics

Phi-3 Small-128K-Instruct performs better with ONNX Runtime compared to PyTorch for all batch size, prompt length combinations. For FP16 CUDA, ORT performs up to 5X faster than PyTorch, while with INT4 CUDA, it's up to 5.9X faster than PyTorch.

The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4.

Batch Size, Prompt Length	ORT FP16 CUDA	PyTorch Eager FP16 CUDA	Speed Up ORT/PyTorch
1, 16	73.60	14.88	4.95
4, 16	287.60	66.25	4.34
16,16	1025.44	66.25	4.38

Batch Size, Prompt Length	ORT INT4 CUDA	PyTorch Eager INT4 CUDA	Speed Up ORT/PyTorch
1, 16	68.26	11.57	5.90
4, 16	151.79	40.18	3.78
16,16	577.41	148.17	3.90

Package Versions

Pip package name	Version
torch	2.3.0
triton	2.3.0
onnxruntime-gpu	1.18.0
transformers	4.40.2
bitsandbytes	0.43.1

Appendix

Model Card Contact

parinitarahi, kvaishnavi, natke

Contributors

Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Tianlei Wu, Sheetal Arun Kadam, Rui Ren, Baiju Meswani, Natalie Kershaw, Parinita Rahi