Couldn't run in colab

#1
by eramax - opened

I thought I might be able to load this model on a colab T4 GPU instance since its size is less than 15GB of VRAM, but I don't know why it couldn't fit and I got this crash

!python examples/chat.py -m /content/model -mode llama

 -- Model: /content/model
 -- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
Traceback (most recent call last):
  File "/content/exllamav2/examples/chat.py", line 126, in <module>
    cache = ExLlamaV2Cache(model, lazy = not model.loaded)
  File "/content/exllamav2/exllamav2/cache.py", line 133, in __init__
    self.create_state_tensors(copy_from, lazy)
  File "/content/exllamav2/exllamav2/cache.py", line 45, in create_state_tensors
    p_key_states = torch.zeros(self.batch_size, self.max_seq_len, self.num_key_value_heads, self.head_dim, dtype = self.dtype, device = self.model.cache_map[i]).contiguous()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 392.00 MiB. GPU 0 has a total capacty of 14.75 GiB of which 209.06 MiB is free. Process 270593 has 14.54 GiB memory in use. Of the allocated memory 14.32 GiB is allocated by PyTorch, and 111.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am using the experimental branch of exllamav2.
I would like to understand the VRAM requirements for each quantization since I think it requires much more than the model size.

Best,

You need to account for the context length. Try restricting that to a very low number first to see if it will load. Otherwise, you're going to need a lower bpw model or more VRAM.

@LoneStriker ., could you please give me the arrguments I should use for the command !python examples/chat.py -m /content/model -mode llama to make this model can fit for the 15 GB T4 VRAM?

Regards,

One sec, sorry was answering a different question.

Try using the -resc and -maxr options and setting them lower than the 250 and 1000 defaults. There's not a whole lot of room to scale down though, so you might need a lower bpw model.

You can see the full options here:

$ python ./examples/chat.py --help
usage: chat.py [-h] [-dm DRAFT_MODEL_DIR] [-nds] [-modes] [-mode {raw,llama,codellama,chatml,tinyllama,zephyr,deepseek,solar}]
               [-un USERNAME] [-bn BOTNAME] [-sp SYSTEM_PROMPT] [-temp TEMPERATURE] [-topk TOP_K] [-topp TOP_P] [-typical TYPICAL]
               [-repp REPETITION_PENALTY] [-maxr MAX_RESPONSE_TOKENS] [-resc RESPONSE_CHUNK] [-ncf] [-c8] [-pt] [-amnesia]
               [-m MODEL_DIR] [-gs GPU_SPLIT] [-l LENGTH] [-rs ROPE_SCALE] [-ra ROPE_ALPHA] [-nfa] [-lm]

Simple Llama2 chat example for ExLlamaV2

options:
  -h, --help            show this help message and exit
  -dm DRAFT_MODEL_DIR, --draft_model_dir DRAFT_MODEL_DIR
                        Path to draft model directory
  -nds, --no_draft_scale
                        If draft model has smaller context size than model, don't apply alpha (NTK) scaling to extend it
  -modes, --modes       List available modes and exit.
  -mode {raw,llama,codellama,chatml,tinyllama,zephyr,deepseek,solar}, --mode {raw,llama,codellama,chatml,tinyllama,zephyr,deepseek,solar}
                        Chat mode. Use llama for Llama 1/2 chat finetunes.
  -un USERNAME, --username USERNAME
                        Username when using raw chat mode
  -bn BOTNAME, --botname BOTNAME
                        Bot name when using raw chat mode
  -sp SYSTEM_PROMPT, --system_prompt SYSTEM_PROMPT
                        Use custom system prompt
  -temp TEMPERATURE, --temperature TEMPERATURE
                        Sampler temperature, default = 0.95 (1 to disable)
  -topk TOP_K, --top_k TOP_K
                        Sampler top-K, default = 50 (0 to disable)
  -topp TOP_P, --top_p TOP_P
                        Sampler top-P, default = 0.8 (0 to disable)
  -typical TYPICAL, --typical TYPICAL
                        Sampler typical threshold, default = 0.0 (0 to disable)
  -repp REPETITION_PENALTY, --repetition_penalty REPETITION_PENALTY
                        Sampler repetition penalty, default = 1.05 (1 to disable)
  -maxr MAX_RESPONSE_TOKENS, --max_response_tokens MAX_RESPONSE_TOKENS
                        Max tokens per response, default = 1000
  -resc RESPONSE_CHUNK, --response_chunk RESPONSE_CHUNK
                        Space to reserve in context for reply, default = 250
  -ncf, --no_code_formatting
                        Disable code formatting/syntax highlighting
  -c8, --cache_8bit     Use 8-bit cache
  -pt, --print_timings  Output timings after each prompt
  -amnesia, --amnesia   Forget context after every response
  -m MODEL_DIR, --model_dir MODEL_DIR
                        Path to model directory
  -gs GPU_SPLIT, --gpu_split GPU_SPLIT
                        "auto", or VRAM allocation per GPU in GB
  -l LENGTH, --length LENGTH
                        Maximum sequence length
  -rs ROPE_SCALE, --rope_scale ROPE_SCALE
                        RoPE scaling factor
  -ra ROPE_ALPHA, --rope_alpha ROPE_ALPHA
                        RoPE alpha value (NTK)
  -nfa, --no_flash_attn
                        Disable Flash Attention
  -lm, --low_mem        Enable VRAM optimizations, potentially trading off speed

@LoneStriker ., could you please give me the arrguments I should use for the command !python examples/chat.py -m /content/model -mode llama to make this model can fit for the 15 GB T4 VRAM?

Regards,

Try this 2.8bpw model:
https://huggingface.co/LoneStriker/Nous-Capybara-limarpv3-34B-2.8bpw-h6-exl2-2

Also you can try these options:

  -c8, --cache_8bit     Use 8-bit cache
  -lm, --low_mem        Enable VRAM optimizations, potentially trading off speed

Thanks @LoneStriker , I tried to tune the arguments but still couldn't run it
https://gist.github.com/eramax/732ab060b8c3adddd12a2ad6f0741d5a

Last attempt, you can try to load this 2.4bpw model:
https://huggingface.co/LoneStriker/Nous-Capybara-limarpv3-34B-2.4bpw-h6-exl2-2

Thanks @LoneStriker , same issue couldn't run because of vram.
https://gist.github.com/eramax/732ab060b8c3adddd12a2ad6f0741d5a

The key is that in order to determine the effectiveness of quantization with respect to vram size, speed, and perplexity, we must have a thorough grasp of the vram needs for each quantized model to be able to run.

Looks like it's the -l option that you're missing. You should be able to load one of the bigger models like 2.8 or 3.0. Loading the 2.4bpw model, I'm using 12384MiB memory.

$ python examples/chat.py -m /models/Nous-Capybara-limarpv3-34B-2.4bpw-h6-exl2-2 --mode llama -l 2048
 -- Model: /models/models/hf/Nous-Capybara-limarpv3-34B-2.4bpw-h6-exl2-2
 -- Options: ['length: 2048', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: llama
 -- System prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

User:
$ nvidia-smi
Sat Dec 16 12:16:52 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |
|  0%   45C    P8              38W / 450W |  12384MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Now it worked, Thanks for help, the model unfortunately does not generates any outputs maybe the prompt, I will try another one.

I tried different models, all gave been loaded successfully by your recommendation of using length (I think the length is the context size)
The models some of them didn't make any outputs like

 -- Model: /content/model
 -- Options: ['length: 2048', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: llama
 -- System prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

User: who are u



User: hello



User: 
^C

Others gave me an error Response exceeded 2000 tokens and was cut short I adjacted the maxr from 1000 to 2000 and even that it didn't give me any outputs

 -- Model: /content/model
 -- Options: ['length: 4096', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: who are u


 !! Response exceeded 2000 tokens and was cut short.

User: 

When I used --low_mem I got this error

 -- Model: /content/model
 -- Options: ['length: 2048', 'rope_scale 1.0', 'rope_alpha 1.0', 'low_mem']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: llama
 -- System prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

User: who are u

Traceback (most recent call last):
  File "/content/exllamav2/examples/chat.py", line 242, in <module>
    generator.begin_stream(active_context, settings)
  File "/content/exllamav2/exllamav2/generator/streaming.py", line 88, in begin_stream
    self._gen_begin_reuse(input_ids, gen_settings)
  File "/content/exllamav2/exllamav2/generator/streaming.py", line 267, in _gen_begin_reuse
    self._gen_begin(in_tokens, gen_settings)
  File "/content/exllamav2/exllamav2/generator/streaming.py", line 253, in _gen_begin
    self.model.forward(self.sequence_ids[:, :-1], self.cache, preprocess_only = True, loras = self.active_loras)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/exllamav2/exllamav2/model.py", line 589, in forward
    r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/exllamav2/exllamav2/model.py", line 665, in _forward
    x = module.forward(x, cache = cache, attn_mask = attn_mask, past_len = past_len, loras = loras, position_offsets = position_offsets)
  File "/content/exllamav2/exllamav2/attn.py", line 497, in forward
    pass_loras,
UnboundLocalError: local variable 'pass_loras' referenced before assignment

I should mentioned that before I didn't use any length size and I had an issue that no logner happens after I used specified the length size which was the chat.py script takes around 90 to 120 seconds doing nothing (I think it calls sleep method as i was shown in colab and no activites on the cpu or the gpu at this period) and after that sleep time, it starts printing and loading the model, this issue only when I didn't specifiy the length size.

Best,

It's the prompt format most likely. Try raw mode, seems to generate text for me:

$ python examples/chat.py -m /models/Nous-Capybara-limarpv3-34B-2.4bpw-h6-exl2-2 -l 2048 --mode raw
 -- Model: /models/Nous-Capybara-limarpv3-34B-2.4bpw-h6-exl2-2
 -- Options: ['length: 2048', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: raw
 -- System prompt:

This is a conversation between a helpful AI assistant named Chatbort and a user.

User: Write a story about llamas

Chatbort: Once upon a time in a land far, far away, there was a village known as Llama-topia. The villagers were an eclectic mix of animals, all of whom coexisted harmoniously. Among them were the kind-hearted llamas. These gentle creatures had become the heart and soul of the village, their wool providing warmth for everyone during the harsh winter months.

One day, a young llama named Lupita noticed that some of her fellow villagers were shivering from the cold. She decided to organize a group of volunteers to gather more wool. They worked tirelessly, shearing the soft wool from their bodies and spinning it into yarn. The yarn was then knitted into beautiful blankets by another group of volunteers.

As days passed, the village became warmer and warmer, with every llama in the village enjoying the warmth provided by the blankets. Lupita and her team felt overjoyed at the sight of their work bringing comfort to those who needed it most. Their spirits lifted even more when they saw how much happier and healthier everyone seemed to be due to their efforts.

From that day forward, Lupita and her fellow llamas continued their work, ensuring that no one in Llama-topia ever had to endure the biting cold again. Their selfless actions created a lasting bond between all villagers, proving that when we work together, anything can be achieved.</s>

It worked successfully, thank you, I even was able to run LoneStriker/Toppy-Mix-4x7B-4.0bpw-h6-exl2-2 smoothly and fast without any issues https://gist.github.com/eramax/485513cd4b1c5d1698fa89ff817b705f
https://gist.github.com/eramax/732ab060b8c3adddd12a2ad6f0741d5a

maybe you can add the ipynb to the examples or the tutorial links so people can get benefit of it.
The links are publicly accessible.

Next thing we could improve:

  1. fix the delay in loading the model in the first run, it occurs as well when i specify the length :(
  2. the output in the jupyter notebook has an issue which is the CR and LF control characters don't show correctly (check the links above)
  3. some models I have tried out and didn't generates any outputs ( i haven't tried all of them one more time, but I think even if I don't provide the correct prompt format, the model should generate an output and some models gave me this error !! Response exceeded 2000 tokens and was cut short. without any outputs
  4. enable offloading layers to the cpu or offloading the tokenizer will be very helpful

Finally the model size is much better than gguf, the speed not measured yet for comparason
Best,

Sign up or log in to comment