Time costs
How long did you quantize your model? I have spent several hours......
I don't remember exactly but I would expect about 30 minutes. Several hours sounds bad. What are you using to quantise, AutoGPTQ or GPTQ-for-Llama? And what stage of the process are you on?
I have found some systems where the "packing model" stage seems to take forever, because the CPU is too weak. On those systems I usually abort it and start again on another system.
Tell me what the HW is and show me a screenshot of the output and I can advise you more.
Out of interest, why are you quantising your own? Did you want different quantisation parameters? This is one model where I didn't go back and add extra quantisation options (like group_size + desc_act), but I could do that if there is demand for them
I set bits to 4 and 8 separately, and group_size to 16 when using AutoGPTQ . It seems the model has been load in cuda:0 and costs about 3 hours to finish quantizing.
To quantize different parameters, I want to compare different inference speed. By the way, have you test your speed of starcoderplus-GPTQ?
try doing this before before running your Python quant script:
OMP_NUM_THREADS=16 OPENBLAS_NUM_THREADS=16 MKL_NUM_THREADS=16 VECLIB_MAXIMUM_THREADS=16 NUMEXPR_NUM_THREADS=16
export OMP_NUM_THREADS OPENBLAS_NUM_THREADS MKL_NUM_THREADS VECLIB_MAXIMUM_THREADS NUMEXPR_NUM_THREADS
Or add the equivalent os.environ[...] =
calls to the Python code. But if you do that, you must put them at the very top of the script, before any imports or done (except os
); they only take effect if set before the code is imported.
If you're using my quant_autogptq.py
script, you can do this:
OMP_NUM_THREADS=16 OPENBLAS_NUM_THREADS=16 MKL_NUM_THREADS=16 VECLIB_MAXIMUM_THREADS=16 NUMEXPR_NUM_THREADS=16 CUDA_VISIBLE_DEVICES=0 python3 /workspace/ProcessQuantization/quant_autogptq.py /path/to/source/model /path/to/output-gptq wikitext --bits 4 --group_size 128 --desc_act 1 --damp 0.1 --dtype float16 --seqlen 4096 --num_samples 128 --use_fast --cache_examples 1
This often helps because by default the code will try to use all available CPU cores during the packing stage. But on Runpod you usually only have a portion of the available CPUs on the host - eg if the host has 128 CPUs and you rented 1 GPU, you will have 128/8 = 16 CPU cores available.
The problem is that inside Docker, any command that tries to get the total number of CPU cores available will see the number of cores on the host, not the number available to the Docker.
The result is that the packing stage can over-saturate the CPU, by trying to use eg 128 threads when only 16 cores are actually available. The above environment variables tell it to use a specific number of cores.
Thanks alot for your help! i have spent a fortune trying to understand this and it makes alot of sense now.