Qwen2.5-Coder-32B-Instruct-int4-ov

Community conversion (not an official Intel / OpenVINO release). This repo exists because the two community OpenVINO IRs of this model already on the Hub (Echo9Zulu/Qwen2.5-Coder-32B-Instruct-int4_sym-awq-ov, AIFunOver/Qwen2.5-Coder-32B-Instruct-openvino-4bit) ship the graph (openvino_model.xml) without the weights (openvino_model.bin) — they are not loadable. This IR is complete (.xml and a ~16.4 GB .bin) and has been verified to load and generate under OpenVINO Model Server.

Description

This is the Qwen2.5-Coder-32B-Instruct model converted to the OpenVINO™ IR (Intermediate Representation) format with weights compressed to INT4 by NNCF.

Quantization Parameters

Weight compression was performed with the following parameters:

  • mode: INT4_SYM
  • ratio: 1.0
  • group_size: 128

The model was exported with Optimum Intel:

optimum-cli export openvino \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --task text-generation-with-past \
  --weight-format int4 --sym --ratio 1.0 --group-size 128 \
  Qwen2.5-Coder-32B-Instruct-int4-ov

For more information on quantization, check the OpenVINO model optimization guide.

Compatibility

The provided OpenVINO™ IR model is compatible with:

  • OpenVINO version 2026.2.0 and higher
  • Optimum Intel 2.0.0 and higher

Conversion toolchain: optimum-intel 2.0.0, openvino 2026.2.1, openvino-tokenizers 2026.2.1, nncf 3.2.0, transformers 5.0.0, torch 2.12.1 (Python 3.12).

Running Model Inference with Optimum Intel

  1. Install packages required for using Optimum Intel integration with the OpenVINO backend:
pip install optimum[openvino]
  1. Run model inference:
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

prompt = "Write a Python function that reverses a singly linked list."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Running Model Inference with OpenVINO GenAI

  1. Install packages required for using OpenVINO GenAI:
pip install openvino-genai huggingface_hub
  1. Download the model from HuggingFace Hub:
import huggingface_hub as hf_hub

model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
model_path = "Qwen2.5-Coder-32B-Instruct-int4-ov"

hf_hub.snapshot_download(model_id, local_dir=model_path)
  1. Run model inference:
import openvino_genai as ov_genai

device = "GPU"  # or "CPU"
pipe = ov_genai.LLMPipeline(model_path, device)
print(pipe.generate("Write a Python function that reverses a singly linked list.", max_new_tokens=256))

More GenAI usage examples can be found in the OpenVINO GenAI library docs and samples.

Verified

  • openvino_model.bin weights present and intact (~16.4 GB INT4) — the point of this re-upload.
  • Loads and generates under OpenVINO Model Server (OVMS) on an Intel Arc Pro B70 (32 GB), GPU device, confirmed with a real chat/completions generation probe.

Limitations

Check the original model card for limitations. This derivative applies INT4 weight-only quantization; outputs may differ slightly from the full-precision base model. No fine-tuning, retraining, or behavioral modification was performed.

Legal information

The original model is distributed under the Apache 2.0 license. This is a redistributed quantized derivative; see the accompanying NOTICE file. All credit for the model belongs to the Qwen team, Alibaba Cloud. More details can be found in the original model card.

Downloads last month
24
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for exzile/Qwen2.5-Coder-32B-Instruct-int4-ov

Base model

Qwen/Qwen2.5-32B
Quantized
(125)
this model