Inference speed on A
Hey @teknium - loved your work both here and also on Twitter.
Since at fp16 it takes only 3.16 GB VRAM for inferencing Phi 1.5, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?
If that is possible and 3ms per token (as claimed in Phi 1.5 technical paper) is also achievable with flash attention - can we generate 7200 tokens (24 copies Γ 300 tokens per second) per second on a A100-80GB GPU?
I'm a non-technical guy. Just asking out of curiosity. Thanks. ππΌ
Hey @teknium - loved your work both here and also on Twitter.
Since at fp16 it takes only 3.16 GB VRAM for inferencing Phi 1.5, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?
If that is possible and 3ms per token (as claimed in Phi 1.5 technical paper) is also achievable with flash attention - can we generate 7200 tokens (24 copies Γ 300 tokens per second) per second on a A100-80GB GPU?
I'm a non-technical guy. Just asking out of curiosity. Thanks. ππΌ
Not sure. It's actually been fairly slow for me lol