Qwen2.5-Coder-32B-Instruct-int4-ov
- Model creator: Qwen
- Original model: Qwen2.5-Coder-32B-Instruct
Community conversion (not an official Intel / OpenVINO release). This repo exists because the two community OpenVINO IRs of this model already on the Hub (
Echo9Zulu/Qwen2.5-Coder-32B-Instruct-int4_sym-awq-ov,AIFunOver/Qwen2.5-Coder-32B-Instruct-openvino-4bit) ship the graph (openvino_model.xml) without the weights (openvino_model.bin) — they are not loadable. This IR is complete (.xmland a ~16.4 GB.bin) and has been verified to load and generate under OpenVINO Model Server.
Description
This is the Qwen2.5-Coder-32B-Instruct model converted to the OpenVINO™ IR (Intermediate Representation) format with weights compressed to INT4 by NNCF.
Quantization Parameters
Weight compression was performed with the following parameters:
- mode: INT4_SYM
- ratio: 1.0
- group_size: 128
The model was exported with Optimum Intel:
optimum-cli export openvino \
--model Qwen/Qwen2.5-Coder-32B-Instruct \
--task text-generation-with-past \
--weight-format int4 --sym --ratio 1.0 --group-size 128 \
Qwen2.5-Coder-32B-Instruct-int4-ov
For more information on quantization, check the OpenVINO model optimization guide.
Compatibility
The provided OpenVINO™ IR model is compatible with:
- OpenVINO version 2026.2.0 and higher
- Optimum Intel 2.0.0 and higher
Conversion toolchain: optimum-intel 2.0.0, openvino 2026.2.1, openvino-tokenizers 2026.2.1,
nncf 3.2.0, transformers 5.0.0, torch 2.12.1 (Python 3.12).
Running Model Inference with Optimum Intel
- Install packages required for using Optimum Intel integration with the OpenVINO backend:
pip install optimum[openvino]
- Run model inference:
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = OVModelForCausalLM.from_pretrained(model_id)
prompt = "Write a Python function that reverses a singly linked list."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Running Model Inference with OpenVINO GenAI
- Install packages required for using OpenVINO GenAI:
pip install openvino-genai huggingface_hub
- Download the model from HuggingFace Hub:
import huggingface_hub as hf_hub
model_id = "exzile/Qwen2.5-Coder-32B-Instruct-int4-ov"
model_path = "Qwen2.5-Coder-32B-Instruct-int4-ov"
hf_hub.snapshot_download(model_id, local_dir=model_path)
- Run model inference:
import openvino_genai as ov_genai
device = "GPU" # or "CPU"
pipe = ov_genai.LLMPipeline(model_path, device)
print(pipe.generate("Write a Python function that reverses a singly linked list.", max_new_tokens=256))
More GenAI usage examples can be found in the OpenVINO GenAI library docs and samples.
Verified
openvino_model.binweights present and intact (~16.4 GB INT4) — the point of this re-upload.- Loads and generates under OpenVINO Model Server (OVMS) on an Intel Arc Pro B70 (32 GB), GPU device, confirmed with a real chat/completions generation probe.
Limitations
Check the original model card for limitations. This derivative applies INT4 weight-only quantization; outputs may differ slightly from the full-precision base model. No fine-tuning, retraining, or behavioral modification was performed.
Legal information
The original model is distributed under the Apache 2.0
license. This is a redistributed quantized derivative; see the accompanying NOTICE file. All credit for
the model belongs to the Qwen team, Alibaba Cloud. More details can be found in the original
model card.
- Downloads last month
- 24
Model tree for exzile/Qwen2.5-Coder-32B-Instruct-int4-ov
Base model
Qwen/Qwen2.5-32B