GGUF quants for AI-MO/NuminaMath-7B-TIR using llama.cpp

Terms of Use: Please check the original model

cthulhu

Quants

  • q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
  • q3_k_s: Uses Q3_K for all tensors
  • q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q4_0: Original quant method, 4-bit.
  • q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
  • q4_k_s: Uses Q4_K for all tensors
  • q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
  • q5_0: Higher accuracy, higher resource usage and slower inference.
  • q5_1: Even higher accuracy, resource usage and slower inference.
  • q5_k_s: Uses Q5_K for all tensors
  • q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
  • q6_k: Uses Q8_K for all tensors
  • q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.
Downloads last month
11
GGUF
Model size
6.91B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference API
Unable to determine this model's library. Check the docs .

Collection including neopolita/numinamath-7b-tir-gguf