File size: 3,630 Bytes
03275d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ> python quantize_script.py --model_name=nvidia/Nemotron-Mini-4B-Instruct  --onnx_path=E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\model.onnx --output_path="E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx"

--Quantize-Script-- algo=awq_lite, dataset=cnn, calib_size=32, batch_size=1, block_size=128, add-position-ids=True, past-kv=True, rcalib=False, device=cpu, use_zero_point=False



--Quantize-Script-- awqlite_alpha_step=0.1, awqlite_fuse_nodes=False, awqlite_run_per_subgraph=False, awqclip_alpha_step=0.05, awqclip_alpha_min=0.5, awqclip_bsz_col=1024, calibration_eps=['dml']

C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\configuration_auto.py:1002: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
C:\Users\vrl\miniconda3\envs\modelopt\Lib\site-packages\transformers\models\auto\tokenization_auto.py:809: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(

--Quantize-Script-- number_of_batched_samples=32, batch-input-ids-list-len=32, batched_attention_mask=32


--Quantize-Script-- number of batched inputs = 32

INFO:root:
Quantizing the model....

INFO:root:Quantization Mode: int4
INFO:root:Finding quantizable weights and augmenting graph output with input activations
INFO:root:Augmenting took 0.03900003433227539 seconds
INFO:root:Saving the model took 35.37520098686218 seconds
2024-11-05 06:08:38.8247274 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-11-05 06:08:38.8385074 [W:onnxruntime:, session_state.cc:1170 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Getting activation names maps...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 192/192 [00:00<?, ?it/s]
Running AWQ scale search per node...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 192/192 [05:08<00:00,  1.61s/it]
INFO:root:AWQ scale search took 308.7233784198761 seconds
Quantizing the weights...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 192/192 [00:05<00:00, 32.75it/s]
INFO:root:Quantizing actual weights took 5.864110231399536 seconds
INFO:root:Inserting DQ nodes and input_pre_quant_scale node using quantized weights and scales ...
INFO:root:Inserting nodes took 0.1272134780883789 seconds
INFO:root:Exporting the quantized graph ...
Loading extension modelopt_round_and_pack_ext...

INFO:root:Exporting took 33.892990589141846 seconds
INFO:root:
Quantization process took 394.4490396976471 seconds
INFO:root:Saving to E:\model_store\genai\nemotron-mini-4b-instruct-fp16-dml-genai\opset_21\default_quant_dml_ep_calib\model.onnx took 33.43196678161621 seconds

Done

(modelopt) PS E:\ModelOpt_Windows_Scripts_2\modelopt-windows-scripts\ONNX_PTQ>