Avoid Loading CPU kernel if User Have a GPU and Cuda Environment

#8
by rdo4920 - opened

Thank you for providing this model for low-GPU-memory users.

There is a potential for improvement, as I encountered several issues while setting up the environment on a Windows 10 machine. However, this model can be used in a normal Win10 environment without requiring a gcc compiler or WSL support. You just need to avoid the CPU kernel loading process.

To achieve this, modify the 'load_cpu_kernel' method in the 'quantization.py' file located in the model folder 'chatglm-6b-int4'. Ensure that the actual loading is not triggered if the GPU device is available by changing it to the following code:

def load_cpu_kernel(**kwargs):
    if not torch.cuda.is_available(): # check before load CPU kernel
        global cpu_kernels
        cpu_kernels = CPUKernel(**kwargs)
        assert cpu_kernels.load

If the user does not have a GPU, the normal CPU kernel loading process will be triggered, which requires a 'gcc' and 'WSL' environment on a Windows machine.

After making the above modification, users can easily load the model from any front-end code without worrying about the 'assert cpu_kernels.load' error. For example, in the chatglm-webui project, simply download the model folder and name it 'chatglm-6b-int4'. Use the following command to load it on a Windows 10 machine (assuming that the cuda and required python libraries are already installed):

python webui.py --model-path chatglm-6b-int4 --precision int4

I successfully loaded the model on a Win10 + 2080 (8GB) machine without gcc or WSL installed.
Thanks again for this awesome model!

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

It is possible that someone has a GPU but he/she wants to use CPU inference (for example the GPU memory is not enough to load the model).
Currently, if the load_cpu_kernel method fails, the exception is caught and the program only prints a warning. The program fails only if both the CPU kernel loading and GPU kernel loading fail. Therefore I don't think it is necessary to avoid the CPU kernel loading process according to CUDA availability.

zxdu20 changed discussion status to closed

Not all load_cpu_kernel calls are wrapped in try/catch section.

For example, in modeling_chatglm.py on line 1430, the loading is triggered, which will cause an AssertionError: 'assert cpu_kernels.load'.
An issue has already been reported about the same error: https://github.com/THUDM/ChatGLM-6B/issues/676

A possible fix could be placing the try/catch section within the 'load_cpu_kernel' function.

It's not a major issue but quite confusing, as the error appears to be related to the CPU, yet users might not want to use the CPU at all.
Just my two cents.

rdo4920 changed discussion status to open
Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Not all load_cpu_kernel calls are wrapped in try/catch section.

For example, in modeling_chatglm.py on line 1430, the loading is triggered, which will cause an AssertionError: 'assert cpu_kernels.load'.
An issue has already been reported about the same error: https://github.com/THUDM/ChatGLM-6B/issues/676

A possible fix could be placing the try/catch section within the 'load_cpu_kernel' function.

It's not a major issue but quite confusing, as the error appears to be related to the CPU, yet users might not want to use the CPU at all.
Just my two cents.

Thank you for your advice. Removed the assert in load_cpu_kernel

rdo4920 changed discussion status to closed

Sign up or log in to comment