65B

#2
by nxnhjrjtbjfzhrovwl - opened

any chance for 65b?

I'd like to. VRAM is the challenge given the sequence lengths in this dataset. I'll see what I can do.

+1 here! Would love to see a 65b version. I have high hopes for it. It could be a strong contender.
Another idea: a 65b version with 4k context. I wonder how that would stack up against GPT-3.5-Turbo (which also has a 4k-long context). Perhaps 4k (a middle ground) in combination with 65b params could be the sweet-spot!

Actually, Ycros went ahead and did this! I haven't experimented with it personally, but he basically took my code verbatim and trained a 65b. See his repo ycros/airoboros-65b-gpt4-1.4.1-PI-8192-4bit-32g-actorder

I have done several experiments since building this model that show substantial improvements. The model I uploaded yesterday incorporates a long sequence pretraining phase and attempts to extend to 16k tokens. Despite the longer context, it outperforms this model even at shorter context lengths. https://huggingface.co/bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-GPTQ

I'll likely train a 65b parameter model incorporating all these improvements soon. I like the idea of only extending to 4k, as it may be possible to do so with minimal damage to short context performance, and on the hardware I have.

Actually, Ycros went ahead and did this! I haven't experimented with it personally, but he basically took my code verbatim and trained a 65b. See his repo ycros/airoboros-65b-gpt4-1.4.1-PI-8192-4bit-32g-actorder

I have done several experiments since building this model that show substantial improvements. The model I uploaded yesterday incorporates a long sequence pretraining phase and attempts to extend to 16k tokens. Despite the longer context, it outperforms this model even at shorter context lengths. https://huggingface.co/bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-GPTQ

I'll likely train a 65b parameter model incorporating all these improvements soon. I like the idea of only extending to 4k, as it may be possible to do so with minimal damage to short context performance, and on the hardware I have.

Thanks! I have already tried Ycros' model but unfortunately it didn't work in Oobabooga (neither ExLLama nor AutoGPTQ - unlike your 33b versions - my experiments were done on 2 x A6000s).
What type of hardware would you require to do the 65b 4k training vs the 65b 8k training?

I've heard of issues from others as well. Weird.

The different hardware requirements are entirely down to the long sequence pre-training. I have a single RTX 6000 Ada (48gb) and it's enough to pretrain 65b to 2700 tokens. Based on my 33b experiments I think this is almost certainly enough for a 4k context extension. 8k might leave more on the table. I don't know. A proper 8k pretrain would likely require at least another 24gb of VRAM; probably much more. Maybe 2x A6000 could do it but there's certainly no guarantee.

How far beyond 4k are you able to go with ExLLama and only one of your A6000s?

I've heard of issues from others as well. Weird.

The different hardware requirements are entirely down to the long sequence pre-training. I have a single RTX 6000 Ada (48gb) and it's enough to pretrain 65b to 2700 tokens. Based on my 33b experiments I think this is almost certainly enough for a 4k context extension. 8k might leave more on the table. I don't know. A proper 8k pretrain would likely require at least another 24gb of VRAM; probably much more. Maybe 2x A6000 could do it but there's certainly no guarantee.

How far beyond 4k are you able to go with ExLLama and only one of your A6000s?

oh interesting.
If "how far beyond 4k" refers to inference (I haven't done any finetuning/training yet), then the highest I've been able to successfully load in ExLLama is 16k with your bhenrym14_airoboros-33b-gpt4-1.4.1-NTK-16384-GPTQ model. It consumes 42530MiB / 49140MiB on a single A6000. However, it doesn't produce very nice results. It stops early (tends to generate only 6-7 words at a time) and also makes grammar errors. The bhenrym14_airoboros-33b-gpt4-1.4.1-PI-8192-GPTQ one seems to be working fine though (consuming around 30968MiB on a single A6000)

On the other hand, Ycros' 65B 8K model does not even load: for some reason, even though I set gpu-split to something like 46,46 in Oobabooga's ExLlama configuration, it seems to completely ignore the second GPU. If on the other hand I try to load it with AutoGPTQ, it loads, but simply produces nonsensical output.

Your experience with the 16k NTK model is consistent with mine and others'. Thanks for the feedback on that. The alpha scaling parameter doesn't appear to correspond to the theoretical scaling multiple very well. Seems like it only manages out to 8-10k before it falls apart. These guys are doing good work on this. They've incorporated each scaling method in a more transparent and sophisticated way that should improve this behavior and more.

Inference is indeed what I meant. I'm just trying to get a sense of whether 65b at 8k would even be usable for many people (including myself!) I'll do some tests here when I get a chance; pending that I can kick off 65b training.

Your experience with the 16k NTK model is consistent with mine and others'. Thanks for the feedback on that. The alpha scaling parameter doesn't appear to correspond to the theoretical scaling multiple very well. Seems like it only manages out to 8-10k before it falls apart. These guys are doing good work on this. They've incorporated each scaling method in a more transparent and sophisticated way that should improve this behavior and more.

Inference is indeed what I meant. I'm just trying to get a sense of whether 65b at 8k would even be usable for many people (including myself!) I'll do some tests here when I get a chance; pending that I can kick off 65b training.

sounds good. Meanwhile, I'll try to figure out what's the deal with ExLLama in Oobabooga ignoring the 2nd GPU (try to isolate the issue: whether it is due to ExLLama or Oobabooga or both)which prevents me from loading a 65b with 8k context.

On the other hand, if a 65b with 4k context would fit in a single A6000, that could be a killer combo.

This comment has been hidden

In my experience the whole of attention is dropped onto GPU1 and if there isn't enough room you get an instant OOM. I have to do a 8,24 load on my dual 3090-esque system in order to get 30b 8k to work, otherwise the attention and model on GPU1 together causes it to crash out. See what happens if you cut back the layer count on GPU0.

In my experience the whole of attention is dropped onto GPU1 and if there isn't enough room you get an instant OOM. I have to do a 8,24 load on my dual 3090-esque system in order to get 30b 8k to work, otherwise the attention and model on GPU1 together causes it to crash out. See what happens if you cut back the layer count on GPU0.

Wow, thanks, it worked now!!
Can you please explain in a bit more detail why that happens?

In my experience the whole of attention is dropped onto GPU1 and if there isn't enough room you get an instant OOM. I have to do a 8,24 load on my dual 3090-esque system in order to get 30b 8k to work, otherwise the attention and model on GPU1 together causes it to crash out. See what happens if you cut back the layer count on GPU0.

Wow, thanks, it worked now!!
Can you please explain in a bit more detail why that happens?

No clue. I only know context seems to all go on GPU1. LLAMA70b has a new fancy context system that's 1/4th its old size so it's much less an issue.

Sign up or log in to comment