Edit model card

Phi3 Mini 128k 4 Bit Quantized


Flash Attention

  • The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
  • When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
  • Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU

Metrics

Total Size:
  • Before: 7.64G
  • After: 2.28G
VRAM Size:
  • Before: 11.47G
  • After: 6.57G
Average Inference Time:
  • Before: 12ms/token
  • After: 5ms/token
Downloads last month
28
Safetensors
Model size
683M params
Tensor type
I32
·
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.