Maximum context length mismatch

#1
by rayyd - opened

The maximum context length is shown as 4096 in the original model card. Not sure how they did it. What is the maximum context length for this ggml version?

I believe it's also 4096, although I do get a warning when I try that:

[pytorch2] ubuntu@h100:/workspace/process $ /workspace/git/llama.cpp/main -c 4096 -m airoboros-13b/ggml/airoboros-13b-gpt4.ggmlv3.q4_0.bin -n 4096 -p "USER: write a story about llamas\nASSISTANT:"
main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 611 (dcb2ed4)
main: seed  = 1685887419
llama.cpp: loading model from airoboros-13b/ggml/airoboros-13b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 3200.00 MB

system_info: n_threads = 13 / 26 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 4096, n_keep = 0


 USER: write a story about llamas\nASSISTANT: Once upon a time, i .......

But that may just be because the metadata is not being set to tell llama.cpp that it's a 4096 context model.

I will do some testing to try and work out if it really does have 4096 context

Actually maybe not. I keep getting complete gibberish with -n 4096.

I think maybe GGML doesn't support this right now. I will ask the llama.cpp about it.

I've put a note in the README to that effect

Thank you for your work!

rayyd changed discussion status to closed

Sign up or log in to comment