Text Generation
Transformers
Safetensors
English
qwen2
code
codeqwen
chat
qwen
qwen-coder
conversational
text-generation-inference
Instructions to use Qwen/Qwen2.5-Coder-3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2.5-Coder-3B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen2.5-Coder-3B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2.5-Coder-3B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2.5-Coder-3B-Instruct
- SGLang
How to use Qwen/Qwen2.5-Coder-3B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen2.5-Coder-3B-Instruct with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2.5-Coder-3B-Instruct
Converting to litertlm with STATIC_WI8_AI16 q recipe requires QSVs
#1
by 4ntoine - opened
I'm trying to convert it to litertlm model:
litert-torch export_hf \
--model=$model \
--output_dir="./static_wi8_ai16" \
--quantization_recipe="static_wi8_ai16" \
--bundle_litert_lm=true
and it fails:
./convert.sh
W0523 12:00:38.290000 4684 torch/distributed/elastic/multiprocessing/redirects.py:35] NOTE: Redirects are currently not supported in MacOs.
W0523 12:00:38.305000 4684 torch/utils/_pytree.py:630] <enum 'KernelPreference'> is an Enum subclass and is now natively supported by torch.compile as an opaque value type. Calling register_constant() on Enum subclasses is deprecated and will be an error in a future release.
W0523 12:00:39.225000 4684 torch/utils/_pytree.py:630] <enum 'ScaleCalculationMode'> is an Enum subclass and is now natively supported by torch.compile as an opaque value type. Calling register_constant() on Enum subclasses is deprecated and will be an error in a future release.
============== Export Configuration ==============
aot_backend : None
aot_compilation_config_dict : None
aot_soc_model : None
auto_model_override : None
batch_size : 1
bundle_litert_lm : 'true'
cache_implementation : 'LiteRTLMCache'
cache_length : 4096
cache_length_dim : None
enable_dynamic_shape : False
experimental_lightweight_conversion : False
experimental_use_mixed_precision : False
export_vision_encoder : False
externalize_embedder : False
externalize_rope : False
extra_kwargs : {}
jinja_chat_template_override : None
k_ts_idx : 2
keep_temporary_files : False
litert_lm_llm_metadata_override : None
litert_lm_model_type_override : None
model : 'Qwen/Qwen2.5-Coder-3B-Instruct'
output_dir : './static_wi8_ai16'
prefill_length_dim : None
prefill_lengths : [128]
quantization_recipe : 'static_wi8_ai16'
single_token_embedder : False
split_cache : False
task : <ExportTask.TEXT_GENERATION: 'text_generation'>
trust_remote_code : False
use_jinja_template : True
v_ts_idx : 3
vision_encoder_quantization_recipe : 'dynamic_wi8_afp32'
work_dir : './static_wi8_ai16/tmptlcjhpwr'
==================================================
(00:00) [START] LiteRT GenAI Export
(00:00) [START] LiteRT GenAI Export > Load source model
Loading weights: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 434/434 [00:01<00:00, 266.46it/s]
(00:05) [ DONE] LiteRT GenAI Export > Load source model (+00:05)
(00:05) [START] LiteRT GenAI Export > Export text prefill-decode model
(00:05) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert
(00:05) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: prefill_128
(00:07) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: prefill_128 > ExportedProgram Run Decompositions
(00:10) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: prefill_128 > ExportedProgram Run Decompositions (+00:03)
(00:10) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: prefill_128 (+00:05)
(00:10) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: decode
(00:12) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: decode > ExportedProgram Run Decompositions
(00:15) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: decode > ExportedProgram Run Decompositions (+00:03)
(00:15) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Torch Export: decode (+00:05)
(00:15) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes
(00:15) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes > ExportedProgram Run Decompositions
(00:15) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes > ExportedProgram Run Decompositions (+00:00)
(00:16) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes > ExportedProgram Run Decompositions
(00:16) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes > ExportedProgram Run Decompositions (+00:00)
(00:16) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run FX Passes (+00:00)
(00:16) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128
(00:16) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > ExportedProgram Run Decompositions
(00:20) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > ExportedProgram Run Decompositions (+00:03)
(00:20) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > ExportedProgram Run Decompositions
(00:20) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > ExportedProgram Run Decompositions (+00:00)
(00:20) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > Create MLIR Module
(00:26) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 > Create MLIR Module (+00:06)
(00:26) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: prefill_128 (+00:10)
(00:26) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode
(00:26) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > ExportedProgram Run Decompositions
(00:30) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > ExportedProgram Run Decompositions (+00:03)
(00:30) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > ExportedProgram Run Decompositions
(00:30) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > ExportedProgram Run Decompositions (+00:00)
(00:30) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > Create MLIR Module
(00:33) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode > Create MLIR Module (+00:03)
(00:33) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Lower to MLIR: decode (+00:06)
(00:33) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Merge MLIR Modules
(00:33) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Merge MLIR Modules (+00:00)
(00:33) [START] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run LiteRT Converter Passes
(02:52) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert > Run LiteRT Converter Passes (+02:19)
(02:52) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > LiteRT-Torch Convert (+02:47)
(02:54) [START] LiteRT GenAI Export > Export text prefill-decode model > Write Model to ./static_wi8_ai16/tmptlcjhpwr/model.tflite
Module size is greater than 2GB
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1779519814.308279 269705 flatbuffer_export.cc:4346] Estimated count of arithmetic ops: 719.641 G ops, equivalently 359.820 G MACs
(03:01) [ DONE] LiteRT GenAI Export > Export text prefill-decode model > Write Model to ./static_wi8_ai16/tmptlcjhpwr/model.tflite (+00:06)
(03:02) [START] LiteRT GenAI Export > Export text prefill-decode model > Quantize model
(03:02) [ FAIL] LiteRT GenAI Export > Export text prefill-decode model > Quantize model
(03:02) [ FAIL] LiteRT GenAI Export > Export text prefill-decode model
(03:02) [ FAIL] LiteRT GenAI Export
Traceback (most recent call last):
File "/opt/homebrew/bin/litert-torch", line 6, in <module>
sys.exit(main())
^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/cli.py", line 30, in main
fire.Fire(CLI())
File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/generative/export_hf/export.py", line 194, in export
exported_model_artifacts = run_export_tasks(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.15_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/generative/export_hf/export.py", line 67, in run_export_tasks
exported_model_artifacts = export_task(
^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.15_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/generative/export_hf/core/export_lib.py", line 353, in export_text_prefill_decode_model
model_path = maybe_quantize_model(model_path, recipe)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/generative/export_hf/core/export_lib.py", line 369, in maybe_quantize_model
return quantize_model(model_path, quantization_recipe)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.15_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litert_torch/generative/export_hf/core/export_lib.py", line 394, in quantize_model
qt.quantize().export_model(quantized_model_path, overwrite=True)
^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/ai_edge_quantizer/quantizer.py", line 470, in quantize
quant_params = self._get_quantization_params(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/ai_edge_quantizer/quantizer.py", line 562, in _get_quantization_params
return params_generator_instance.generate_quantization_parameters(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/ai_edge_quantizer/params_generator.py", line 91, in generate_quantization_parameters
raise RuntimeError(
RuntimeError: Model quantization statistics values (QSVs) are required for the input recipe. This can be obtained by running calibration on sample dataset.
Anybody?