Strange error while running model

#3
by bezale - opened

@TheBloke maybe you know the quick fix to

ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)

error while trying to run the goliath-120b.Q2_K.gguf model with llama-cpp-python?

Below are the model loading log:

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 137
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = mostly Q2_K
llm_load_print_meta: model params     = 117.75 B
llm_load_print_meta: model size       = 46.22 GiB (3.37 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.45 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 2691.22 MB
llm_load_tensors: offloading 130 repeating layers to GPU
llm_load_tensors: offloaded 130/140 layers to GPU
llm_load_tensors: VRAM used: 44638.75 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1096.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)

I've tried to change gpu_layers number, context length - nothing helps, and always the same error with same numbers

Thanks!

Same exact error attempting this model on runpod. I genuinely have no clue whats causing it. It works on my main machine...

Same problem here: running goliath-120b.Q6_K.gguf with ctransformers in a 2xXeon, 128RAM, 8Gb NVIDIA.

Seems to me that the same value that was increased in ccp needs to be increased somewhere in the ctransformers library as well.

Problem solved using llama-cpp-python, without any changes in llama source code. Now I have to figure out how send to some layers to the GPU... noob issues :) Thanks!

This comment has been hidden

Sign up or log in to comment