Inference speed on A

by KrishnaKaasyap - opened Oct 15, 2023

Oct 15, 2023

Hey @teknium - loved your work both here and also on Twitter.

Since at fp16 it takes only 3.16 GB VRAM for inferencing Phi 1.5, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?

If that is possible and 3ms per token (as claimed in Phi 1.5 technical paper) is also achievable with flash attention - can we generate 7200 tokens (24 copies × 300 tokens per second) per second on a A100-80GB GPU?

I'm a non-technical guy. Just asking out of curiosity. Thanks. 🙏🏼

teknium

Owner Oct 15, 2023

Hey @teknium - loved your work both here and also on Twitter.

Since at fp16 it takes only 3.16 GB VRAM for inferencing Phi 1.5, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?

If that is possible and 3ms per token (as claimed in Phi 1.5 technical paper) is also achievable with flash attention - can we generate 7200 tokens (24 copies × 300 tokens per second) per second on a A100-80GB GPU?

I'm a non-technical guy. Just asking out of curiosity. Thanks. 🙏🏼

Not sure. It's actually been fairly slow for me lol

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment