Apply for community grant: Personal project (gpu and storage)

#1
by leonardlin - opened
AUGMXNT org

I'm training a SOTA JA/EN bilingual open model (already beats out the current best JA model, stabilityai's 70B in both JA fluency and benchmarks) - doing a proper training run this week and will be releasing (model, datasets, code) soon and figured it might be nice to try setting up a space so people could easily try it out (at least at launch)?

Screenshot from 2023-11-16 15-04-29.png

Hi @randomfoo, we've assigned t4-small to this Space with 15 minute sleep time for now as the Space is not ready yet. Let us know when it's ready so we can change the sleep time to 1 hour or something.

AUGMXNT org

Hiya,

So I switched to gradio since it seems like it was easier to setup a chat interface (sort of, but holy crap the docs are bad and it took waaaaay too much time to get it working). Still finally got it working deving on local system. A few questions on this space:

  • I originally started with streamlit, can I switch this space to a gradio instance? I didn't see that in settings. If not do I start a new space and followup again? I guess the convos are attached per space? I will probably also rename this space, just wonder if that'll cause problems?

  • I'm trying to run a 7B, it seems to maybe be running out of ram or vram - I tried load_in_4bit w/ bnb but it's still not going well. In theory, Q4 should be <7GB of VRAM right?

  • Is there any way to save the model locally? every time it restarts, it's pulling the model again? Is that right? (i'm just pulling mistral-7b from right now)

Thanks!

I originally started with streamlit, can I switch this space to a gradio instance? I didn't see that in settings. If not do I start a new space and followup again? I guess the convos are attached per space? I will probably also rename this space, just wonder if that'll cause problems?

You can change the SDK from streamlit to gradio by updating your README.md. https://huggingface.co/spaces/augmxnt/test7b/blob/d3878745e30ebbebfb3521bbbea4c830d68319e7/README.md?code=true#L6-L7
I think it'll be fine to rename your Space.

I'm trying to run a 7B, it seems to maybe be running out of ram or vram - I tried load_in_4bit w/ bnb but it's still not going well. In theory, Q4 should be <7GB of VRAM right?

I think you should test your demo on your local machine or Google Colab, etc. first. T4 is usually enough for 7B models if you load it in 4bit or 8bit.

Is there any way to save the model locally? every time it restarts, it's pulling the model again? Is that right? (i'm just pulling mistral-7b from right now)

You can attach the persistent storage to your Space and set an environment variable HF_HOME=/data. But generally speaking, it's a good idea to debug your demo on your local environment before deploying it to Spaces.

AUGMXNT org

Hi, took a little longer than expected but we've launched our model now, wondering if it'd be possible to get this space's spec upped for a while w/ a longer sleep time and maybe some local storage for a while? Our model: https://huggingface.co/augmxnt/shisa-7b-v1

AUGMXNT org

(also, did the original grant expire, looks like it's back on CPU so doesn't run, sadly)

Hi @leonardlin , sorry, looks like I missed your comment. I assigned t4-small and set the sleep time to 1 hour.

AUGMXNT org

Ah thanks, just updated the README.md - first time using a space so still learning the ropes :)

Thanks!

@leonardlin I just noticed that your Space is not working properly due to CUDA OOM. I've upgraded the hardware to a10g-small for now. But it would be nice if you could look into it.

logs:

===== Application Startup at 2023-12-07 05:07:31 =====


tokenizer_config.json:   0%|          | 0.00/11.4k [00:00<?, ?B/s]
tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 11.4k/11.4k [00:00<00:00, 37.2MB/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]
tokenizer.model: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 493k/493k [00:00<00:00, 101MB/s]

tokenizer.json:   0%|          | 0.00/6.14M [00:00<?, ?B/s]
tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.14M/6.14M [00:00<00:00, 75.5MB/s]

added_tokens.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]
added_tokens.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.42k/1.42k [00:00<00:00, 6.24MB/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]
special_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 552/552 [00:00<00:00, 2.32MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

config.json:   0%|          | 0.00/605 [00:00<?, ?B/s]
config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 605/605 [00:00<00:00, 2.68MB/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]
model.safetensors.index.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 23.9k/23.9k [00:00<00:00, 63.9MB/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]
model-00001-of-00005.safetensors:   0%|          | 0.00/3.92G [00:00<?, ?B/s]
model-00001-of-00005.safetensors:   0%|          | 10.5M/3.92G [00:03<19:25, 3.35MB/s]
model-00001-of-00005.safetensors:  27%|โ–ˆโ–ˆโ–‹       | 1.06G/3.92G [00:05<00:12, 235MB/s] 
model-00001-of-00005.safetensors:  54%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 2.13G/3.92G [00:06<00:04, 435MB/s]
model-00001-of-00005.safetensors:  71%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ   | 2.77G/3.92G [00:08<00:02, 389MB/s]
model-00001-of-00005.safetensors:  84%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 3.28G/3.92G [00:09<00:01, 381MB/s]
model-00001-of-00005.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 3.92G/3.92G [00:10<00:00, 385MB/s]

Downloading shards:  20%|โ–ˆโ–ˆ        | 1/5 [00:10<00:41, 10.46s/it]
model-00002-of-00005.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]
model-00002-of-00005.safetensors:   0%|          | 10.5M/3.93G [00:01<09:46, 6.67MB/s]
model-00002-of-00005.safetensors:   1%|โ–         | 52.4M/3.93G [00:02<02:53, 22.4MB/s]
model-00002-of-00005.safetensors:   4%|โ–Ž         | 147M/3.93G [00:03<01:22, 46.0MB/s] 
model-00002-of-00005.safetensors:   5%|โ–Œ         | 210M/3.93G [00:04<01:12, 51.3MB/s]
model-00002-of-00005.safetensors:  10%|โ–ˆ         | 398M/3.93G [00:06<00:42, 82.6MB/s]
model-00002-of-00005.safetensors:  26%|โ–ˆโ–ˆโ–Œ       | 1.01G/3.93G [00:07<00:12, 229MB/s]
model-00002-of-00005.safetensors:  32%|โ–ˆโ–ˆโ–ˆโ–      | 1.26G/3.93G [00:10<00:18, 142MB/s]
model-00002-of-00005.safetensors:  37%|โ–ˆโ–ˆโ–ˆโ–‹      | 1.46G/3.93G [00:12<00:18, 135MB/s]
model-00002-of-00005.safetensors:  48%|โ–ˆโ–ˆโ–ˆโ–ˆโ–Š     | 1.90G/3.93G [00:13<00:10, 195MB/s]
model-00002-of-00005.safetensors:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ    | 2.16G/3.93G [00:15<00:10, 162MB/s]
model-00002-of-00005.safetensors:  60%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ    | 2.37G/3.93G [00:18<00:12, 126MB/s]
model-00002-of-00005.safetensors:  65%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–   | 2.54G/3.93G [00:19<00:10, 133MB/s]
model-00002-of-00005.safetensors:  76%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ  | 2.99G/3.93G [00:20<00:04, 194MB/s]
model-00002-of-00005.safetensors:  83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 3.25G/3.93G [00:22<00:03, 180MB/s]
model-00002-of-00005.safetensors:  89%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š | 3.48G/3.93G [00:24<00:02, 167MB/s]
model-00002-of-00005.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 3.93G/3.93G [00:24<00:00, 160MB/s]

Downloading shards:  40%|โ–ˆโ–ˆโ–ˆโ–ˆ      | 2/5 [00:35<00:56, 18.99s/it]
model-00003-of-00005.safetensors:   0%|          | 0.00/3.93G [00:00<?, ?B/s]
model-00003-of-00005.safetensors:   0%|          | 10.5M/3.93G [00:03<19:15, 3.39MB/s]
model-00003-of-00005.safetensors:   3%|โ–Ž         | 126M/3.93G [00:04<01:39, 38.2MB/s] 
model-00003-of-00005.safetensors:   8%|โ–Š         | 325M/3.93G [00:05<00:41, 87.7MB/s]
model-00003-of-00005.safetensors:  17%|โ–ˆโ–‹        | 682M/3.93G [00:06<00:19, 170MB/s] 
model-00003-of-00005.safetensors:  24%|โ–ˆโ–ˆโ–Ž       | 923M/3.93G [00:07<00:15, 188MB/s]
model-00003-of-00005.safetensors:  29%|โ–ˆโ–ˆโ–‰       | 1.15G/3.93G [00:11<00:26, 105MB/s]
model-00003-of-00005.safetensors:  34%|โ–ˆโ–ˆโ–ˆโ–Ž      | 1.32G/3.93G [00:12<00:24, 107MB/s]
model-00003-of-00005.safetensors:  45%|โ–ˆโ–ˆโ–ˆโ–ˆโ–     | 1.76G/3.93G [00:13<00:12, 171MB/s]
model-00003-of-00005.safetensors:  51%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ     | 2.00G/3.93G [00:14<00:10, 181MB/s]
model-00003-of-00005.safetensors:  57%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹    | 2.24G/3.93G [00:18<00:14, 115MB/s]
model-00003-of-00005.safetensors:  66%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ   | 2.58G/3.93G [00:19<00:09, 149MB/s]
model-00003-of-00005.safetensors:  72%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–  | 2.81G/3.93G [00:21<00:06, 163MB/s]
model-00003-of-00005.safetensors:  77%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹  | 3.04G/3.93G [00:22<00:05, 175MB/s]
model-00003-of-00005.safetensors:  83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž | 3.27G/3.93G [00:23<00:04, 157MB/s]
model-00003-of-00005.safetensors:  91%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–| 3.59G/3.93G [00:24<00:01, 192MB/s]
model-00003-of-00005.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 3.93G/3.93G [00:25<00:00, 156MB/s]

Downloading shards:  60%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ    | 3/5 [01:00<00:43, 21.98s/it]
model-00004-of-00005.safetensors:   0%|          | 0.00/3.17G [00:00<?, ?B/s]
model-00004-of-00005.safetensors:   0%|          | 10.5M/3.17G [00:02<14:19, 3.68MB/s]
model-00004-of-00005.safetensors:   5%|โ–Œ         | 168M/3.17G [00:03<00:55, 54.5MB/s] 
model-00004-of-00005.safetensors:  18%|โ–ˆโ–Š        | 577M/3.17G [00:04<00:15, 168MB/s] 
model-00004-of-00005.safetensors:  26%|โ–ˆโ–ˆโ–Œ       | 818M/3.17G [00:07<00:19, 124MB/s]
model-00004-of-00005.safetensors:  33%|โ–ˆโ–ˆโ–ˆโ–Ž      | 1.06G/3.17G [00:09<00:18, 115MB/s]
model-00004-of-00005.safetensors:  38%|โ–ˆโ–ˆโ–ˆโ–Š      | 1.22G/3.17G [00:11<00:15, 122MB/s]
model-00004-of-00005.safetensors:  55%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    | 1.73G/3.17G [00:12<00:06, 208MB/s]
model-00004-of-00005.safetensors:  63%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Ž   | 2.00G/3.17G [00:14<00:06, 175MB/s]
model-00004-of-00005.safetensors:  70%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ   | 2.23G/3.17G [00:18<00:08, 113MB/s]
model-00004-of-00005.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 3.17G/3.17G [00:19<00:00, 165MB/s]

Downloading shards:  80%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  | 4/5 [01:20<00:21, 21.01s/it]
model-00005-of-00005.safetensors:   0%|          | 0.00/984M [00:00<?, ?B/s]
model-00005-of-00005.safetensors:   1%|          | 10.5M/984M [00:01<02:48, 5.79MB/s]
model-00005-of-00005.safetensors:   6%|โ–‹         | 62.9M/984M [00:02<00:35, 26.1MB/s]
model-00005-of-00005.safetensors: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰| 984M/984M [00:03<00:00, 264MB/s]  

Downloading shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5/5 [01:24<00:00, 14.89s/it]
Downloading shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5/5 [01:24<00:00, 16.90s/it]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]
Loading checkpoint shards:  20%|โ–ˆโ–ˆ        | 1/5 [00:11<00:44, 11.09s/it]
Loading checkpoint shards:  40%|โ–ˆโ–ˆโ–ˆโ–ˆ      | 2/5 [00:23<00:35, 11.67s/it]
Loading checkpoint shards:  60%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ    | 3/5 [00:35<00:23, 11.84s/it]
Loading checkpoint shards:  80%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  | 4/5 [00:44<00:11, 11.01s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5/5 [00:47<00:00,  8.11s/it]
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5/5 [00:47<00:00,  9.58s/it]

generation_config.json:   0%|          | 0.00/133 [00:00<?, ?B/s]
generation_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 133/133 [00:00<00:00, 531kB/s]
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-11 (generate):
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
    outputs = self.model(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
    layer_outputs = decoder_layer(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 639, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 175, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 386, in forward
    state.subB = (outliers * state.SCB.view(-1, 1) / 127.0).t().contiguous().to(A.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 200.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 161.56 MiB is free. Process 791650 has 14.42 GiB memory in use. Of the allocated memory 11.95 GiB is allocated by PyTorch, and 2.33 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-12 (generate):
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
    outputs = self.model(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
    layer_outputs = decoder_layer(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 626, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 244, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 421, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (57x2083 and 2080x4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Exception in thread Thread-13 (generate):
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/generation/utils.py", line 2801, in sample
    outputs = self(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1009, in forward
    outputs = self.model(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 897, in forward
    layer_outputs = decoder_layer(
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 639, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 175, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 452, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/user/.pyenv/versions/3.10.13/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 386, in forward
    state.subB = (outliers * state.SCB.view(-1, 1) / 127.0).t().contiguous().to(A.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 222.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 25.56 MiB is free. Process 791650 has 14.55 GiB memory in use. Of the allocated memory 12.16 GiB is allocated by PyTorch, and 2.25 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
AUGMXNT org

Hi, I'll take a look, I have it w/ load_in_8bits and it spun up fine originally, so not sure why it ran out. I'll test it out locally on my dev box just to sanity check the size tomorrow!

AUGMXNT org

@hysts OK, figured out the issue, I was testing with Mistral 7B before but our model uses more memory (because of the tokenizer?) and goes over way over. I switched the code to load_in_4bit and it should load up in like ~5GB VRAM, although it grows w/ context... On my local box I was using use_flash_attention_2 which saves some memory, but when I put it in my requirements the build complained about not having torch. Is there a way to stage library installs w/ spaces?

@leonardlin Thanks for looking into this. Hmm, not sure but maybe you can try adding torch in pre-requirements.txt?
https://huggingface.co/docs/hub/spaces-dependencies#adding-your-own-dependencies

AUGMXNT org

OK, pre-requirements.txt done, FA2 running, that was an adventure. Looks like the EN announcement starting to percolate through JA LLM twitter so will be good to see the response: https://twitter.com/webbigdata/status/1733044645687595382

BTW, made some interesting discoveries along the way in case you guys are going to make a default docs/templates for deploying LLM demos - the Gradio default chat example uses streamer in a threadpool, but streamer is actually not threadsafe and will end up barfing context between sessions. Also, as of 4.3.0, the docs say the examples should be passed as a list, but if you have additional_inputs, then it breaks and has to be a list of lists. I have no idea why ๐Ÿ˜… Also concurrency_limit seemed to be another heisenbug (eg seemed to work locally but not on the space), so that was sort of an adventure!

Anyway, thanks again for all the help w/ my first HF Spaces experience!

leonardlin changed discussion status to closed

@leonardlin Thanks for the feedback! I'll share this internally.

Hello @leonardlin , thank you for your feedback. We greatly appreciate hearing from our users.

Also concurrency_limit seemed to be another heisenbug (eg seemed to work locally but not on the space),

How do you mean? During my testing with Colab and Spaces demo, I found that the chat interface is respecting the queue size.

the docs say the examples should be passed as a list, but if you have additional_inputs, then it breaks and has to be a list of lists.

We can maybe make our documentation clearer by providing examples that include additional inputs.
When calling a function (for text generation in this case) with multiple inputs (text prompt and additional inputs in this case), we need to provide example values for all of them in a list. If we have more than one example, we need to pass in a list of lists of example inputs. You can find more information on this in our Docs (https://www.gradio.app/docs/examples#initialization) and guides (https://www.gradio.app/guides/more-on-examples#providing-examples).

Heya @ysharma

How do you mean? During my testing with Colab and Spaces demo, I found that the chat interface is respecting the queue size.

Well heisenbug because in the documentation and locally I was able to assign concurrency_limit but on the HF space it barfed. Since it was a clean rebuild I assumed it'd be pulling the same version but maybe not?

the docs say the examples should be passed as a list, but if you have additional_inputs, then it breaks and has to be a list of lists.

The documentation for https://www.gradio.app/docs/chatinterface#initialization says:

examples
list[str] | None
default: None

sample inputs for the function; if provided, appear below the chatbot and can be clicked to populate the chatbot input.

and this in fact list[str] works, but if you also include additional_inputs, it needs to be changed to list[list[str]]. If it's supposed to mirror https://www.gradio.app/docs/examples might need to link there, maybe there's a better way for the docs to stay in sync if all Examples() are supposed to behave the same way.

Sign up or log in to comment