Opinion about this model.

#1
by mirek190 - opened

I can say this 65B model writes stories much better ( complex and long sentences ) than dromedary-lora-65B-q5_1 ... far more better...
Something closer to alpaca-lora 65B which is a bit worse than this model anyway... actually reading now what I got from gpt4-alpaca-lora_mlp-65B ...damm this is good :D

Even with the match problems is good.
Negative numbers are difficult for smaller llama models but here is done ... well.

solve this equation and explain each step 2Y-12=-16
The equation is already in standard form. We can solve it by adding 12 to both sides of the equation, isolating Y on one side, and then applying the inverse operation of subtraction. Here are the steps:

  1. Start with 2Y - 12 = -16
  2. Adding 12 to both sides, we get: 2Y = -4
  3. Now dividing both sides by 2, we have: Y = -2.

** Note that the inverse operation of subtraction is addition and the inverse operation of multiplication is division. **

In conclusion:
Dromedary-lora-65B is not even worth to keep on my SSD :P
Alpaca-lora 65B is better than dromedary-lora-65B and has no restrictions at all , zero censorship... REALLY zero. Even some small models with uncensored in name has limits....I tried hard ... For test purposes such questions would probably send me straight to jail ... :D
Gpt4-alpaca-lora_mlp-65B is the best so far but is censored not badly but censored....NSFW works at least :P

I'm happy is something happening with 65B models which are more capable in everything plus better reasoning.

How do you always know I've uploaded models before I've even told anyone? :)

Anyway, thanks for the feedback! I've not even tried it myself yet. Glad to hear it's good for creative, because Dromedary was awful for that IMHO.

Does it work with 32GB RAM, even if super slow?

I can't say for sure about this model, but usually 65B requires ~38Gb ram, however, you'd better ask @mirek190 .

Does it work with 32GB RAM, even if super slow?

I'd say with huge amount of swap memory (i.e. your SSD) it is not impossible but implausible per se.

You can see the RAM requirements in my README

image.png

You technically could load the model with 32GB RAM if you have enough swap space. But it will be unbelievably, unpleasantly slow. I would not recommend it at all. It would take tens of minutes to load the model, and might take a minute or more to generate a single token.

Try a 30B model instead.

Or if you really want to use 65B, you could try it in the cloud. Microsoft Azure and Google Compute both give free credits to new users - $200 from Azure, $300 from Google Compute. You can't get GPU machines for free, but you can get decent CPU-only servers, eg a Linux server with 16 cores and 128GB RAM. That would be usable with this model.

You can see the RAM requirements in my README

image.png

You technically could load the model with 32GB RAM if you have enough swap space. But it will be unbelievably, unpleasantly slow. I would not recommend it at all. It would take tens of minutes to load the model, and might take a minute or more to generate a single token.

Try a 30B model instead.

Or if you really want to use 65B, you could try it in the cloud. Microsoft Azure and Google Compute both give free credits to new users - $200 from Azure, $300 from Google Compute. You can't get GPU machines for free, but you can get decent CPU-only servers, eg a Linux server with 16 cores and 128GB RAM. That would be usable with this model.

Hello TheBloke

I'm getting almost 1 token /s on my CPU i9 9900k and 4 channel dd4 3800 MHz .
I'm curious how many tokens are you getting on 2x 4090 with that model.

With the 4bit GPTQ model + 2 x 4090, testing on the Dromedary 65B model (but it should be the same for this one), I got:

Output generated in 95.16 seconds (4.97 tokens/s, 473 tokens, context 33, seed 2085900431)
Output generated in 424.31 seconds (3.54 tokens/s, 1504 tokens, context 33, seed 1719700952)

That's with streaming enabled; it'd probably be a bit quicker without streaming.

That second query ran out of memory after 1500 tokens returned. So I can't use the full context size of the model.

I tried to enable CPU offloading so it wouldn't run out of memory, but I couldn't get that working. I think maybe there's some bugs with CPU offloading + multiple GPUs, at the moment.

This will hopefully be fixed in the future, eg once AutoGPTQ is ready.

I'm interesting what was more bottleneck - ram bandwidth or core number?

Hmm 4-5 tokens on GPU most powerful consumer one for the time being ... 4x -5x times faster than my ( old ) CPU.
Thanks

just about 1 token/s on Ryzen 5900x + 3090ti using the new gpu offloading in llama.cpp (oobabooga webui, windows 11, q4_0, --n_gpu_layers 41). Quite slow (1t/s) but for coding tasks works absolutely best from all models I've tried. Maybe I should try it on linux
edit: I moved to linux and now it "runs" 1.7t/s... not great but already usable

Today I tested it on an A6000 48GB with llama.cpp's new offloading feature and got 6.86 tokens/s (145.73 ms per run) - pretty cool!

Testing the GPTQ version, I got 12.46 tokens/s with CUDA AutoGPTQ, and 6.12 tok/s with Triton AutoGPTQ. But Triton uses less VRAM and won't OOM in 48GB, where CUDA does at around 1600 tokens returned.

Impressive!

Do the requirement stated RAM not VRAM? if so, my 3090 can run the model in 5bit as long as I have 64GB RAM?

YES

I have rtx 3090 as well
65B q5_1 you can put almost 40 layers on gpu. I have 700 ms/t .

@nulled I was able to run the q3ks model on 32gb ram with gpu offloading 20 layers, was quite slow though, total time was about 21 mins for the response, I let it do it's thing while I took a coffee break, but it's great to see I can run a 65b model locally

gpt4Alpaca65B_21mins.PNG

At least works ;)
Few moth ago even run 7B model was nearly impossible on local PC.

Very true!

Sign up or log in to comment