OPEA/Qwen2.5-3B-Instruct-int4-sym-inc

!auto-round
--model Qwen/Qwen2.5-3B-Instruct
--device 0
--group_size 64
--nsamples 128
--bits 4
--iters 500
--disable_eval
--model_dtype "fp16"
--format 'auto_gptq,auto_round'
--output_dir "/content/a"

COLAB T4

2024-12-18 23:32:38.980328: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-18 23:32:39.000587: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-18 23:32:39.007262: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-18 23:32:39.022033: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-18 23:32:40.284679: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-12-18 23:32:46,270 INFO config.py L54: PyTorch version 2.5.1+cu121 available.
2024-12-18 23:32:46,274 INFO config.py L66: Polars version 1.9.0 available.
2024-12-18 23:32:46,276 INFO config.py L77: Duckdb version 1.1.3 available.
2024-12-18 23:32:46,277 INFO config.py L112: TensorFlow version 2.17.1 available.
2024-12-18 23:32:46,278 INFO config.py L125: JAX version 0.4.33 available.
2024-12-18 23:32:46 INFO llm.py L318: start to quantize Qwen/Qwen2.5-3B-Instruct
2024-12-18 23:32:46 INFO utils.py L573: Using GPU device
Loading checkpoint shards: 100% 2/2 [00:04<00:00, 2.22s/it]
2024-12-18 23:33:35 INFO autoround.py L230: using torch.float16 for quantization tuning
2024-12-18 23:33:35 INFO autoround.py L300: start to cache block inputs
Filter: 20% 2000/10000 [00:02<00:09, 887.47 examples/s] ^C

OPEA
/

Qwen2.5-3B-Instruct-int4-sym-inc

STOP