Errors in loading models(error:Killed)

#2
by LittleGreen - opened

GPU device used is: A800 (80G)
Memory size is: 64G
But the following error log appears during the loading of the model:
'''
[2023-10-25 09:14:47,988] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-25 09:14:50,747] [INFO] building CogVLMModel model ...
[2023-10-25 09:14:50,749] [INFO] [RANK 0] > initializing model parallel with size 1
[2023-10-25 09:14:50,750] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
[2023-10-25 09:15:05,366] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 17639685376
[2023-10-25 09:15:11,894] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/user/CogVLM/main/CogVLM-main/cogvlm-chat/1/mp_rank_00_model_states.pt
Killed
'''
How should this situation be handled?

Sounds like OOM.

Based on the troubleshooting, OOM did occur

LittleGreen changed discussion status to closed

Sign up or log in to comment