Edit model card

Quantizations for Gryphe/MythoMax-L2-13b in the EXL2 format

Quant Mem 4k Mem 4k8 Δp test wiki test pippa
4k_hb8_b8_pippa 17.2GB 15.7GB -0.0183 5.7781 4.4221
4k_hb6_b5_pippa 12.6GB 11.0GB -0.0197 6.0141 4.4252
2k_hb8_b8_pippa 17.2GB 15.7GB -0.0246? 5.7762 4.4238
2k_hb6_b5_pippa 12.6GB 11.0GB -0.0121 6.0823 4.4363
4k_hb8_b8 17.2GB 15.7GB 5.7459 4.4247
4k_hb6_b6 15GB
4k_hb6_b5 12.6GB 11.0GB 5.7699 4.4514
2k_hb8_b8 17.2GB 15.7GB 5.7572 4.4242
2k_hb6_b4.125

Breaking down the names:

  • 4k is calibrated with 4096 length as opposed to the default 2048
  • hb8 is a header depth of 8 bits
  • b8 is a model weight average of 8.0 bits
  • pippa is calibrated with pippa-llama2-chat instead of wikitext

Additional analysis:

  • Δp | (quant_perplexity - base_perplexty) as reported in stdout during quant creation. Unsure if useful
  • test wiki | Perplexity reported by test_inference.py against wikitext with 4096 length
  • test pippa | Perplexity reported by test_inference.py against pippa chat with 4096 length

(Possible) Takeaways:

  • The difference between 2k and 4k 8bit calibration is about 0.01% averaged across both tests for pippa calibration and 0.1% for wiki calibration. Wiki 4k was also measured with 4x the row count which likely is part of the variance.
  • 4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
  • Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
  • Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.

Detractors:

  • Real world usage may produce more visible differences than a hundreth of a percent on a small test
  • I have not tested using more rows over using a greater length
  • I have not tested increasing row count on the final quant for the same measurement
  • It's unclear if or how Δp should be used

With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff. However, real world usage would probably favor picking datasets more related to the tune; all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.

All quantizations were calibrated with wikitext-2 unless otherwise specified

You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.

VRAM estimates are performed with an extremely long chatlog in oobabooga webui on a 7900 XTX using nvtop to monitor pytorch usage only. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with flash attention 2 will use less VRAM than otherwise estimated.

The measurement files are provided in the main branch so you can make your own quants at other bit depths without going through the 2-3 hours of measuring.

Downloads last month

-

Downloads are not tracked for this model. How to track
Unable to determine this model's library. Check the docs .