nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Any way I can run it on my low-mid tier HP Desktop? specs attached as a .png, btw i know its probably a long shot.

vgrowhouse

Oct 21, 2024

vgrowhouse

Oct 21, 2024

100% stock no upgraded ram. Also if you are reading this, could my old GTS450 run this?

Noxi-V

Oct 22, 2024

A simple answer is, no, it's like trying to fit a train in a car or rather, a bike
It's on huggingchat so use it there instead

iiBLACKii

Oct 22, 2024

WTF. Are you running on systems lol!. Bro you even can't run on Kaggle or Collabs (best freely available Notebooks).

tarruda

Oct 22, 2024

A refurbished mac studio m1 ultra with 128gb RAM can be found on e-bay for $2.5k-$3k and can run 70b models at q8 at ~7.5 tokens/sec which IMO is perfect for chatting (slightly above my reading speed). Up to 8k tokens it is still OK at ~5 tokens/sec.

It can also fit a 64k context in VRAM if you mess around with iogpu.wired_limit_mb (increasing the max VRAM allocation), but with 32k tokens in the context the speed drops to around 2 tokens/sec which is not good for interactive chat but still usable if you are not in a rush (eg: ask it to summarize a big document and go for a walk).

A-Martinez-2508

Oct 22, 2024

Even better, you can get a m2 or m3 mini mac for about 600 - 800 dollars and use it soley for this purpose

tarruda

Oct 22, 2024

Even better, you can get a m2 or m3 mini mac for about 600 - 800 dollars and use it soley for this purpose

Yes a mac mini can fit a 70b model in VRAM, but memory bandwidth and GPU performance doesn't compare with mac studio with ultra processor. Here's a video of someone running a 70b model in mac mini:https://www.youtube.com/watch?v=xyKEQjUzfAk (it works but very slow).