Text Generation
Transformers
PyTorch
English
mixtral
conversational
Inference Endpoints
text-generation-inference

Higher PPL than Mixtral?

#11
by Thireus - opened

I ran a PPL eval and noticed that the PPL is much higher than the original Mixtral model on wikitext.

  • LoneStriker_dolphin-2.5-mixtral-8x7b-6.0bpw-h6-exl2-2 - 4.464363098144531
  • turboderp_Mixtral-8x7B-instruct-exl2_8.0bpw - 3.7087724208831774

I was wondering if this is expected.

For ref, 70b dolphin models give me PPLs just below 4:

  • LoneStriker_dolphin-2.2-70b-6.0bpw-h6-exl2-2 - 3.965563297271729

You're comparing 6.0bpw to 8.0bpw, so yes, it's expected that they might have higher perplexity in general, but also with exl2, the quality can vary depending on the calibration dataset used during quantization.

@HiroseKoichi , it's a 0.76 PPL jump.

If someone can share the PPL on the non-quantized version I'd be interested to see how far it is from Mixtral original model.

Taking a look at Turboderp's page, it looks like your test is the outlier here and dolphin is right in line with the expected numbers: https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2
Z86yJWJMzT4gljf27mmTF.png
I'm by no means an expert on exl2 quantization, but wikitext is a popular calibration dataset for exl2 quants, which could explain why the perplexity is much lower for Mixtral-Instruct. Try running it through a different dataset.

Sign up or log in to comment