Qwen2.5-Coder-32B-Instruct-int4-ov

Model creator: Qwen
Original model: Qwen2.5-Coder-32B-Instruct

Community conversion (not an official Intel / OpenVINO release). This repo exists because the two community OpenVINO IRs of this model already on the Hub (Echo9Zulu/Qwen2.5-Coder-32B-Instruct-int4_sym-awq-ov, AIFunOver/Qwen2.5-Coder-32B-Instruct-openvino-4bit) ship the graph (openvino_model.xml) without the weights (openvino_model.bin) — they are not loadable. This IR is complete (.xml and a ~16.4 GB .bin) and has been verified to load and generate under OpenVINO Model Server.

Description

This is the Qwen2.5-Coder-32B-Instruct model converted to the OpenVINO™ IR (Intermediate Representation) format with weights compressed to INT4 by NNCF.

Quantization Parameters

Weight compression was performed with the following parameters:

mode: INT4_SYM
ratio: 1.0
group_size: 128

The model was exported with Optimum Intel:

optimum-cli export openvino \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --task text-generation-with-past \
  --weight-format int4 --sym --ratio 1.0 --group-size 128 \
  Qwen2.5-Coder-32B-Instruct-int4-ov

For more information on quantization, check the OpenVINO model optimization guide.

Compatibility

The provided OpenVINO™ IR model is compatible with:

OpenVINO version 2026.2.0 and higher
Optimum Intel 2.0.0 and higher

Conversion toolchain: optimum-intel 2.0.0, openvino 2026.2.1, openvino-tokenizers 2026.2.1, nncf 3.2.0, transformers 5.0.0, torch 2.12.1 (Python 3.12).

Running Model Inference with Optimum Intel

Install packages required for using Optimum Intel integration with the OpenVINO backend:

pip install optimum[openvino]

Run model inference:

from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)

prompt = "Write a Python function that reverses a singly linked list."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Running Model Inference with OpenVINO GenAI

Install packages required for using OpenVINO GenAI:

pip install openvino-genai huggingface_hub

Download the model from HuggingFace Hub:

import huggingface_hub as hf_hub

model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
model_path = "Qwen2.5-Coder-32B-Instruct-int4-ov"

hf_hub.snapshot_download(model_id, local_dir=model_path)

Run model inference:

import openvino_genai as ov_genai

device = "GPU"  # or "CPU"
pipe = ov_genai.LLMPipeline(model_path, device)
print(pipe.generate("Write a Python function that reverses a singly linked list.", max_new_tokens=256))

More GenAI usage examples can be found in the OpenVINO GenAI library docs and samples.

Verified

openvino_model.bin weights present and intact (~16.4 GB INT4) — the point of this re-upload.
Loads and generates under OpenVINO Model Server (OVMS) on an Intel Arc Pro B70 (32 GB), GPU device, confirmed with a real chat/completions generation probe.

Limitations

Check the original model card for limitations. This derivative applies INT4 weight-only quantization; outputs may differ slightly from the full-precision base model. No fine-tuning, retraining, or behavioral modification was performed.

Legal information

The original model is distributed under the Apache 2.0 license. This is a redistributed quantized derivative; see the accompanying NOTICE file. All credit for the model belongs to the Qwen team, Alibaba Cloud. More details can be found in the original model card.

Downloads last month: 24

Model tree for exzile/Qwen2.5-Coder-32B-Instruct-int4-ov

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Quantized

(125)

this model