General tips around inference speed?

#3
by jloganolson - opened

Maybe I just haven't run this large of a model before but I'm blown away by the difference in speed for this 16b model versus a 12b parameter model like pythia. Are there any speed-up tips? Things I'm doing so far:

  • On pytorch 2.0
  • Confirmed it's running on CUDA (device=0 and VRAM is soaked)
  • torch_dtype=torch.bfloat16
  • even tried loading as 8 bit to no noticeable speed-up (but I guess that last one is more to alleviate memory than speed things up).

Any other ideas?

Did pytorch 2.0 gave you noticable improvement?

I actually didn't try Pytorch 1.x so I don't know! The one last thing I was going to try was deepspeed inference (per this https://www.deepspeed.ai/tutorials/inference-tutorial/) but I don't know how much improvement I'm going to see on a single GPU machine.

Try editing the config.json file to say use_cache: true. That will help.

Hugging Face H4 org

cc @lewtun

Try editing the config.json file to say use_cache: true. That will help.

thanks.
I've noticed 40% faster inference by using use_cache: true

Sign up or log in to comment