HuggingFaceH4/zephyr-7b-beta · I just wanna thank everyone that worked on this

Nov 2, 2023

Amazing job. This is really uncanny. I'm running this on a rtx2070 and an i5 from years ago on my second computer. This is nuts, I'm getting 6 tokens per second. This opens up a whole other world of possibilities of running local applications for me. Of course it's not 100% perfect, but what do you expect with a model this size? How incredibly far we've gotten.

lucastan

Nov 6, 2023

Hey Ryann, would you mind sharing how you did that?
I am trying on RTX3070 but failed with error

Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

lucastan

Nov 6, 2023

@Ryann How much GPU RAM it took?