Minimum VRAM / GPU specs?

#30
by sociopathic-hamster - opened

Will this run on 4 x 3090 (96GB total VRAM)?

(and can it be ROPE scaled?)

Will this run on 4 x 3090 (96GB total VRAM)?

yes, it will run on 96gbs of vram

Amazing, thanks! I wonder what the performance will be like... I mean, quantized versions of these SOTA 70b releases are an amazing advance and totally usable for many downstream applications, but subjectively the inference quality is noticably degraded. Compared to the original models, some of which subjectively "feel" equal or superior to ChatGPT (chat versions) or text-davinci-003 (llama 2 70b base model), at least for workloads that do not involve heavy code generation.

If it really is possible to run Stable Beluga 2 and other SOTA finetunes of llama-2-70b on consumer-grade hardware like this, the path to a GPT-4 killer becomes quite straightforward:

Step 1: deploy a number of 70b models in parallel, some generalists like SB2, some other models finetuned for various workloads, and a bunch of RWKV finetunes for certain tasks (RWKV handles things like creative writing shockingly well and fits in 16gb of vram)
Step 2: put them behind a fast 13B model, fine-tuned for high performance classification tasks. A super-classifier if you will, which delegates requests to the most appropriate large model(s), assembles the result, and presents it to the user. Since any of the 70b generalist models currently trending are ALREADY on par with GPT 3.5 / ChatGPT, even a naive implementation of this architecture where no significant finetuning or training is done beyond prompt engineering, and that simply directs the request based on the gross strengths of the various models (e.g. code-writing vs mathematical reasoning vs multistep COT etc) - will be unarguably superior to GPT 3.5, and nipping at the heels of GPT4.
Step 2a: Add SOTA SDXL and various CV models to the system, making it fully multimodal like GPT-4 supposedly is (but they censored it)
Step 3: finetune the 70b models so they know how and when to ask for help from their peers, creating a mesh topology that MAY not be all that different than from GPT-4's current architecture (we know its made from multiple models, we don't know what their individual characteristics are or how they work together). For bonus points, integrate a web scraper completely behind the scenes, that the models call upon when they need more context, just like a human would do when they google something... and then save the scraped data in a vector database that serves as long term memory shared by ALL the models, so that next time they don't have to go to the web for answers (and periodically we can use the data stored in that retrieval system for finetuning the models themselves)

At that point, presuming that the models are capable of admitting they don't know the answer, or we create a specialist model for predicting whether an answer from another model is hallucinated, we now have a multi-tiered knowledge engine.

do i know the answer?
does my colleague know the answer?
none of us know the answer, do we have anything in our knowledge base to guide us to the answer?
we don't? scraper, please google that for me and return cleaned up text to provide me with some context
OK now we know the answer, please save this context to the knowledge base

Steps 1 and 2 can be done by a normal (reasonably good) full-stack engineer who doesn't mind spending $500 for cheap GPU compute on vast.ai to prove the concept and has familiarity with LLMs...
Step 3 should be doable in a few months by a team of 5 guys, including at least 1 senior ML engineer, server-wise you're looking at an 8xA100 rig for inference, another for finetuning, and a bunch of smaller / cheaper instances for various ancillary models and classical computing. Probably $30k to surpass GPT-4 on reproducable benchmarks (not including salary for the 5 guys lol)

BTW i fit into the "reasonably good full-stack engineer" category... I'm building cool apps that use LLMs, multimodal pipelines, etc... but I don't have the academic background in neural network design, etc. So I might be missing something here :)

Amazing, thanks! I wonder what the performance will be like... I mean, quantized versions of these SOTA 70b releases are an amazing advance and totally usable for many downstream applications, but subjectively the inference quality is noticably degraded. Compared to the original models, some of which subjectively "feel" equal or superior to ChatGPT (chat versions) or text-davinci-003 (llama 2 70b base model), at least for workloads that do not involve heavy code generation.

If it really is possible to run Stable Beluga 2 and other SOTA finetunes of llama-2-70b on consumer-grade hardware like this, the path to a GPT-4 killer becomes quite straightforward:

Step 1: deploy a number of 70b models in parallel, some generalists like SB2, some other models finetuned for various workloads, and a bunch of RWKV finetunes for certain tasks (RWKV handles things like creative writing shockingly well and fits in 16gb of vram)
Step 2: put them behind a fast 13B model, fine-tuned for high performance classification tasks. A super-classifier if you will, which delegates requests to the most appropriate large model(s), assembles the result, and presents it to the user. Since any of the 70b generalist models currently trending are ALREADY on par with GPT 3.5 / ChatGPT, even a naive implementation of this architecture where no significant finetuning or training is done beyond prompt engineering, and that simply directs the request based on the gross strengths of the various models (e.g. code-writing vs mathematical reasoning vs multistep COT etc) - will be unarguably superior to GPT 3.5, and nipping at the heels of GPT4.
Step 2a: Add SOTA SDXL and various CV models to the system, making it fully multimodal like GPT-4 supposedly is (but they censored it)
Step 3: finetune the 70b models so they know how and when to ask for help from their peers, creating a mesh topology that MAY not be all that different than from GPT-4's current architecture (we know its made from multiple models, we don't know what their individual characteristics are or how they work together). For bonus points, integrate a web scraper completely behind the scenes, that the models call upon when they need more context, just like a human would do when they google something... and then save the scraped data in a vector database that serves as long term memory shared by ALL the models, so that next time they don't have to go to the web for answers (and periodically we can use the data stored in that retrieval system for finetuning the models themselves)

At that point, presuming that the models are capable of admitting they don't know the answer, or we create a specialist model for predicting whether an answer from another model is hallucinated, we now have a multi-tiered knowledge engine.

do i know the answer?
does my colleague know the answer?
none of us know the answer, do we have anything in our knowledge base to guide us to the answer?
we don't? scraper, please google that for me and return cleaned up text to provide me with some context
OK now we know the answer, please save this context to the knowledge base

Steps 1 and 2 can be done by a normal (reasonably good) full-stack engineer who doesn't mind spending $500 for cheap GPU compute on vast.ai to prove the concept and has familiarity with LLMs...
Step 3 should be doable in a few months by a team of 5 guys, including at least 1 senior ML engineer, server-wise you're looking at an 8xA100 rig for inference, another for finetuning, and a bunch of smaller / cheaper instances for various ancillary models and classical computing. Probably $30k to surpass GPT-4 on reproducable benchmarks (not including salary for the 5 guys lol)

BTW i fit into the "reasonably good full-stack engineer" category... I'm building cool apps that use LLMs, multimodal pipelines, etc... but I don't have the academic background in neural network design, etc. So I might be missing something here :)

Stable Beluga 2 will never beat chatgpt because it is being trained on chatgpt outputs

It's being fine tuned on chatgpt outputs, true... but it was pretrained on far more than that.

I do see your point. What I'm saying is that finetuning has become so easy and computationally cheap, it will be possible to achieve great results by using multiple fine tuned variants of this and other contemporary models, if the orchestration between them is done right

do you mean MoE?

Sign up or log in to comment