TheBloke/airoboros-13b-gpt4-GGML · Maximum context length mismatch

Jun 4, 2023

The maximum context length is shown as 4096 in the original model card. Not sure how they did it. What is the maximum context length for this ggml version?

TheBloke

Owner Jun 4, 2023

I believe it's also 4096, although I do get a warning when I try that:

[pytorch2] ubuntu@h100:/workspace/process $ /workspace/git/llama.cpp/main -c 4096 -m airoboros-13b/ggml/airoboros-13b-gpt4.ggmlv3.q4_0.bin -n 4096 -p "USER: write a story about llamas\nASSISTANT:"
main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results
main: build = 611 (dcb2ed4)
main: seed  = 1685887419
llama.cpp: loading model from airoboros-13b/ggml/airoboros-13b-gpt4.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
.
llama_init_from_file: kv self size  = 3200.00 MB

system_info: n_threads = 13 / 26 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 4096, n_batch = 512, n_predict = 4096, n_keep = 0


 USER: write a story about llamas\nASSISTANT: Once upon a time, i .......

But that may just be because the metadata is not being set to tell llama.cpp that it's a 4096 context model.

I will do some testing to try and work out if it really does have 4096 context

TheBloke

Owner Jun 4, 2023

Actually maybe not. I keep getting complete gibberish with -n 4096.

I think maybe GGML doesn't support this right now. I will ask the llama.cpp about it.

TheBloke

Owner Jun 4, 2023

I've put a note in the README to that effect

rayyd

Jun 4, 2023

Thank you for your work!

rayyd changed discussion status to closed Jun 4, 2023