File size: 6,716 Bytes
1b438fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
(venv2) D:\modelopt-windows-scripts\ONNX_PTQ>python D:\opset21_patrice.py --onnx_path="D:\GenAI\models\FP16_Mistral-Nemo-Instruct-2407_ONNX\model.onnx" --output_path="D:\GenAI\models\FP16_Mistral-Nemo-Instruct-2407_ONNX\opset_21\model.onnx" Printing opset info of given input model... Domain: Version: 14 Domain: com.microsoft Version: 1 Printing opset info of output model... Domain: Version: 21 Domain: com.microsoft Version: 1 (venv2) D:\modelopt-windows-scripts\ONNX_PTQ>python quantize_script.py --model_name=mistralai/Mistral-Nemo-Instruct-2407 --onnx_path=D:\GenAI\models\FP16_Mistral-Nemo-Instruct-2407_ONNX\opset_21\model.onnx --output_path="D:\GenAI\models\FP16_Mistral-Nemo-Instruct-2407_ONNX\opset_21\default_quant_cuda_ep_calib\model.onnx" --calibration_eps=cuda --Quantize-Script-- algo=awq_lite, dataset=cnn, calib_size=32, batch_size=1, block_size=128, add-position-ids=True, past-kv=True, rcalib=False, device=cpu, use_zero_point=False --Quantize-Script-- awqlite_alpha_step=0.1, awqlite_fuse_nodes=False, awqlite_run_per_subgraph=False, awqclip_alpha_step=0.05, awqclip_alpha_min=0.5, awqclip_bsz_col=1024, calibration_eps=['cuda'] D:\venv2\Lib\site-packages\transformers\models\auto\configuration_auto.py:1002: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( D:\venv2\Lib\site-packages\transformers\models\auto\tokenization_auto.py:809: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn( --Quantize-Script-- number_of_batched_samples=32, batch-input-ids-list-len=32, batched_attention_mask=32 --Quantize-Script-- number of batched inputs = 32 INFO:root: Quantizing the model.... INFO:root:Quantization Mode: int4 INFO:root:Finding quantizable weights and augmenting graph output with input activations INFO:root:Augmenting took 0.031656503677368164 seconds INFO:root:Saving the model took 60.20284128189087 seconds 2024-11-05 22:37:34.5783341 [W:onnxruntime:, transformer_memcpy.cc:74 onnxruntime::MemcpyTransformer::ApplyImpl] 11 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. 2024-11-05 22:37:34.5949880 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-11-05 22:37:34.6026375 [W:onnxruntime:, session_state.cc:1170 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. Getting activation names maps...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 280/280 [00:00<?, ?it/s] Running AWQ scale search per node...: 100%|ββββββββββββββββββββββββββββββββββββββββββ| 280/280 [17:50<00:00, 3.82s/it] INFO:root:AWQ scale search took 1070.4731740951538 seconds Quantizing the weights...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ| 280/280 [00:15<00:00, 17.78it/s] INFO:root:Quantizing actual weights took 15.744078636169434 seconds INFO:root:Inserting DQ nodes and input_pre_quant_scale node using quantized weights and scales ... INFO:root:Inserting nodes took 0.17318105697631836 seconds INFO:root:Exporting the quantized graph ... Loading extension modelopt_round_and_pack_ext... INFO:root:Exporting took 59.45134162902832 seconds INFO:root: Quantization process took 1223.9775414466858 seconds INFO:root:Saving to D:\GenAI\models\FP16_Mistral-Nemo-Instruct-2407_ONNX\opset_21\default_quant_cuda_ep_calib\model.onnx took 9.476586818695068 seconds Done (venv2) D:\modelopt-windows-scripts\ONNX_PTQ>pip list Package Version -------------------- ------------------------- aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 annotated-types 0.7.0 attrs 24.2.0 certifi 2024.8.30 charset-normalizer 3.4.0 cloudpickle 3.1.0 colorama 0.4.6 coloredlogs 15.0.1 cppimport 22.8.2 cupy-cuda12x 13.3.0 datasets 3.1.0 dill 0.3.8 fastrlock 0.8.2 filelock 3.16.1 flatbuffers 24.3.25 frozenlist 1.5.0 fsspec 2024.9.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 Jinja2 3.1.4 Mako 1.3.6 markdown-it-py 3.0.0 MarkupSafe 3.0.2 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 multiprocess 0.70.16 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-modelopt 0.20.1.dev20+g299b7f8a098 onnx 1.16.0 onnx-graphsurgeon 0.5.2 onnxconverter-common 1.14.0 onnxmltools 1.12.0 onnxruntime-gpu 1.20.0 packaging 24.1 pandas 2.2.3 pip 24.0 propcache 0.2.0 protobuf 3.20.2 pyarrow 18.0.0 pybind11 2.13.6 pydantic 2.9.2 pydantic_core 2.23.4 Pygments 2.18.0 pyreadline3 3.5.4 python-dateutil 2.9.0.post0 pytz 2024.2 PyYAML 6.0.2 regex 2024.9.11 requests 2.32.3 rich 13.9.4 safetensors 0.4.5 scipy 1.14.1 setuptools 65.5.0 six 1.16.0 sympy 1.13.3 tokenizers 0.20.2 torch 2.4.0 tqdm 4.66.6 transformers 4.46.1 typing_extensions 4.12.2 tzdata 2024.2 urllib3 2.2.3 xxhash 3.5.0 yarl 1.17.1 [notice] A new release of pip is available: 24.0 -> 24.3.1 [notice] To update, run: python.exe -m pip install --upgrade pip (venv2) D:\modelopt-windows-scripts\ONNX_PTQ> |