Can there be a 64K model? I have 128GB of system ram but this model is throwing errors in the latest text generation webui:

#1
by CR2022 - opened

2023-11-10 22:04:01 ERROR:Failed to load the model.
Traceback (most recent call last):
File "K:\AI\text-generation-webui\modules\ui_model_menu.py", line 210, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "K:\AI\text-generation-webui\modules\models.py", line 85, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "K:\AI\text-generation-webui\modules\models.py", line 242, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "K:\AI\text-generation-webui\modules\llamacpp_model.py", line 91, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "K:\AI\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp\llama.py", line 422, in init
self.scores: npt.NDArray[np.single] = np.ndarray(
^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 47.7 GiB for an array with shape (200000, 64000) and data type float32

Exception ignored in: <function LlamaCppModel.__del__ at 0x000002266B0CA340>
Traceback (most recent call last):
File "K:\AI\text-generation-webui\modules\llamacpp_model.py", line 49, in del
self.model.del()
^^^^^^^^^^
Models with these large context size are an amazing idea but if they do not work with a 128GB system ram then I wonder what amount of ram they do need in order to work.

I tried both the 6B and 32B models with 200K context both throw same error it is weird how they can run out of memory when my task manager in windows 11 show they are not even close to reaching that amount of ram usage.

That looks like a Python specific error, not llama.cpp itself. The error is coming from numpy, which I assume is called by llama-cpp-python - the Python library that provides llama.cpp inference.

I don't know why you're getting that error specifically, given you have plenty of RAM - it's failing at 47GB apparently, well before your RAM limit. I have seen issues before on Windows where you need to allocate a large pagefile, even if you have plenty of RAM. So you could try allocating a 128GB Page File and see if that helps.

If that doesn't help, please raise it as an issue on the llama-cpp-python Github, and maybe they can help.

In the meantime you might want to try using llama.cpp directly, either via the command line or via its server which provides an API you can hit from Python code. That won't suffer from this issue and will be able to use your full system RAM (and some GPU VRAM as well, if you have a suitable GPU and want to use it) so you can see how much context you can get.

@cr2022 You must be either setting the context size to 200K or maybe letting it default to that. I ran the 34B 200K on my 64GB system without issues. But see below: there's an issue with these models.

@TheBloke I generated the 34B earlier today and found that the token id metadata wasn't getting correctly read. There's a fix in https://github.com/ggerganov/llama.cpp/pull/3981

It also contains a script that will let you dump metadata and see the problem:

$ python gguf-py/scripts/gguf-dump.py yi-6b-200k.Q4_K_M.gguf

* Loading: yi-6b-200k.Q4_K_M.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 20 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 291
      3: UINT64     |        1 | GGUF.kv_count = 17
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = '01-ai_yi-6b-200k'
      6: UINT32     |        1 | llama.context_length = 200000
      7: UINT32     |        1 | llama.embedding_length = 4096
      8: UINT32     |        1 | llama.block_count = 32
      9: UINT32     |        1 | llama.feed_forward_length = 11008
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 32
     12: UINT32     |        1 | llama.attention.head_count_kv = 4
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: FLOAT32    |        1 | llama.rope.freq_base = 5000000.0
     15: UINT32     |        1 | general.file_type = 15
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    64000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    64000 | tokenizer.ggml.scores
     19: [INT32]    |    64000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | general.quantization_version = 2

There's no tokenizer.ggml.bos_token_id, etc. I believe the metadata for the non-200K models matches, so that also would be a problem for those (but I think it used to be different since I generated models that did have that metadata).

That looks like a Python specific error, not llama.cpp itself. The error is coming from numpy, which I assume is called by llama-cpp-python - the Python library that provides llama.cpp inference.

I don't know why you're getting that error specifically, given you have plenty of RAM - it's failing at 47GB apparently, well before your RAM limit. I have seen issues before on Windows where you need to allocate a large pagefile, even if you have plenty of RAM. So you could try allocating a 128GB Page File and see if that helps.

If that doesn't help, please raise it as an issue on the llama-cpp-python Github, and maybe they can help.

In the meantime you might want to try using llama.cpp directly, either via the command line or via its server which provides an API you can hit from Python code. That won't suffer from this issue and will be able to use your full system RAM (and some GPU VRAM as well, if you have a suitable GPU and want to use it) so you can see how much context you can get.

Thank you for your detailed reply and the suggestions :)

@cr2022 You must be either setting the context size to 200K or maybe letting it default to that. I ran the 34B 200K on my 64GB system without issues. But see below: there's an issue with these models.

@TheBloke I generated the 34B earlier today and found that the token id metadata wasn't getting correctly read. There's a fix in https://github.com/ggerganov/llama.cpp/pull/3981

It also contains a script that will let you dump metadata and see the problem:

$ python gguf-py/scripts/gguf-dump.py yi-6b-200k.Q4_K_M.gguf

* Loading: yi-6b-200k.Q4_K_M.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 20 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 291
      3: UINT64     |        1 | GGUF.kv_count = 17
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = '01-ai_yi-6b-200k'
      6: UINT32     |        1 | llama.context_length = 200000
      7: UINT32     |        1 | llama.embedding_length = 4096
      8: UINT32     |        1 | llama.block_count = 32
      9: UINT32     |        1 | llama.feed_forward_length = 11008
     10: UINT32     |        1 | llama.rope.dimension_count = 128
     11: UINT32     |        1 | llama.attention.head_count = 32
     12: UINT32     |        1 | llama.attention.head_count_kv = 4
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: FLOAT32    |        1 | llama.rope.freq_base = 5000000.0
     15: UINT32     |        1 | general.file_type = 15
     16: STRING     |        1 | tokenizer.ggml.model = 'llama'
     17: [STRING]   |    64000 | tokenizer.ggml.tokens
     18: [FLOAT32]  |    64000 | tokenizer.ggml.scores
     19: [INT32]    |    64000 | tokenizer.ggml.token_type
     20: UINT32     |        1 | general.quantization_version = 2

There's no tokenizer.ggml.bos_token_id, etc. I believe the metadata for the non-200K models matches, so that also would be a problem for those (but I think it used to be different since I generated models that did have that metadata).

Thank you for the reply I will look into it :)

CR2022 changed discussion status to closed

Thank you @KerfuffleV2 - I am re-generating 6B-200K and 34B-200K GGUFs with the latest llama.cpp, with your refactored gguf-py.py code.

TheBloke changed discussion status to open

@TheBloke No problem. You might also want t o double check the non-200K ones to make sure they have tokenizer.ggml.bos_token_id and tokenizer.ggml.eos_token_id.

Another thing you may want to do is set tokenizer.ggml.bos_token_id to 2: llama.cpp doesn't respect add_bos_token in the HF metadata so it always adds the BOS token. However, those models weren't trained with an initial BOS token so it really hurts their quality, but an EOS seems (mostly) OK. With the refactor pull we now will add the flag of whether to add BOS/EOS to the GGUF metadata but nothing actually uses it yet so the BOS still gets added.

python gguf-py/scripts/gguf-set-metadata.py model-filename.gguf tokenizer.ggml.bos_token_id 2

The only downside I can think of is if someone wants to fine-tune based off those GGUF models and somehow needs the BOS id to be the original value, but that seems fairly unlikely.

Sign up or log in to comment