Inference takes roughly 3 minutes on a 4090

#4
by macadeliccc - opened

Edit: Doesnt fit on 4090 at all. I had just made an assumption based on every other 7b model, but the demo code wasnt using cuda because it didn't fit

I made it work on a 3050 Ti Laptop so it's probably something with the settings

honestly thats really weird I have not had that issue with any other 7b model. Are you explicitly putting the model and tokenizer onto the GPU? If not then its likely to just use system memory with the demo code

@macadeliccc model is loaded into system memory not GPU memory, GPU memory handles compute. I am running it on 61 GB RAM and it occupies roughly 97% of system memory, so you would need something around that to do inference using a 4090.

@kreouzisv Thank you. I have just been using the 8Bit quants from TheBloke with llama.cpp and GPU acceleration. Seems to be much more efficient than the raw model.

Sign up or log in to comment