Gemma-7B-Instruct-ONNX

Model Summary

This repository contains optimized versions of the gemma-7b-it model, designed to accelerate inference using ONNX Runtime. These optimizations are specifically tailored for CPU and DirectML. DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning, offering GPU acceleration across a wide range of supported hardware and drivers, including those from AMD, Intel, NVIDIA, and Qualcomm.

ONNX Models

Here are some of the optimized configurations we have added:

  • ONNX model for int4 DirectML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
  • ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved performance. For mobile devices, we recommend using the model with acc-level-4.

Usage

Installation and Setup

To use the Gemma-7B-Instruct-ONNX model on Windows with DirectML, follow these steps:

  1. Create and activate a Conda environment:
conda create -n onnx python=3.10
conda activate onnx
  1. Install Git LFS:
winget install -e --id GitHub.GitLFS
  1. Install Hugging Face CLI:
pip install huggingface-hub[cli]
  1. Download the model:
huggingface-cli download EmbeddedLLM/gemma-7b-it-onnx --include="onnx/directml/gemma-7b-it-int4/*" --local-dir .\gemma-7b-it-onnx
  1. Install necessary Python packages:
pip install numpy==1.26.4
pip install onnxruntime-directml
pip install --pre onnxruntime-genai-directml
  1. Install Visual Studio 2015 runtime:
conda install conda-forge::vs2015_runtime
  1. Download the example script:
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py" -OutFile "phi3-qa.py"
  1. Run the example script:
python phi3-qa.py -m .\gemma-7b-it-onnx

Hardware Requirements

Minimum Configuration:

  • Windows: DirectX 12-capable GPU (AMD/Nvidia)
  • CPU: x86_64 / ARM64

Tested Configurations:

  • GPU: AMD Ryzen 8000 Series iGPU (DirectML)
  • CPU: AMD Ryzen CPU

Model Page: Gemma

This model card corresponds to the 7B instruct version of the Gemma model. You can also visit the model card of the 2B base model, 7B base model, and 2B instruct model.

Resources and Technical Documentation:

Terms of Use: Terms

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) has been turned off for this model.

Collection including EmbeddedLLM/gemma-7b-it-onnx