Can I specify the number of threads used in CPU reasoning?

#13
by byzp - opened

CPU reasoning seems to use half the number of kernel threads by default. Can I improve it to get faster speed?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Of course you can.
Pass parallel_num=your_threads_num to quantize() when quantizing.
Or if your model has already been loaded, call quantize() again, and reset the cpu core number used in quantization:

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).cpu().float()
model = model.quantize(bits=4, parallel_num=your_threads_num)

However, inappropriate parallel_num can harm efficiency, it is not recommended to exceed the number of cores.

Sign up or log in to comment