THUDM/CogVLM · Errors in loading models（error：Killed）

Oct 25, 2023

GPU device used is: A800 (80G)
Memory size is: 64G
But the following error log appears during the loading of the model:
'''
[2023-10-25 09:14:47,988] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-25 09:14:50,747] [INFO] building CogVLMModel model ...
[2023-10-25 09:14:50,749] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-25 09:14:50,750] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-25 09:15:05,366] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2023-10-25 09:15:11,894] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/user/CogVLM/main/CogVLM-main/cogvlm-chat/1/mp_rank_00_model_states.pt
Killed
'''
How should this situation be handled?

YoYo1234Qwerty

Oct 26, 2023

Sounds like OOM.

LittleGreen

Oct 27, 2023

Based on the troubleshooting, OOM did occur

LittleGreen changed discussion status to closed Oct 27, 2023