πŸΊπŸ¦β€β¬› LLM Comparison/Test: Phi-4, Qwen2 VL 72B Instruct, Aya Expanse 32B in my updated MMLU-Pro CS benchmark

Community Article Published January 10, 2025

Top Models @ 2025-01-10

Introduction

I've updated my MMLU-Pro Computer Science LLM benchmark results with new data from recently tested models: three Phi-4 variants (Microsoft's official weights, plus Unsloth's fixed HF and GGUF versions), Qwen2 VL 72B Instruct, and Aya Expanse 32B.

While adding these new models, I streamlined the graph by removing all QwQ-32B-Preview variants except for the main model, which consistently showed superior performance. I also incorporated the results from a third evaluation run for Claude, gemini-1.5-pro-002 and Athene-V2-Chat, which shifted Athene's rank downward due to fluctuations in its scores.

Here's an additional visualization that represents each model as a 3D bar where the height shows the MMLU score (%), the depth represents the model size in billions of parameters, and for quantized models, the bar is split into a full-color front section proportional to the quantized size and a lighter-colored back section showing the memory savings compared to full precision (16-bit) models:

Top Models (3D) @ 2025-01-10

New Models Tested

  1. Phi-4:

    • Unsloth's fixed Transformers implementation showed minimal performance differences in benchmarks, with the GGUF version achieving marginally higher accuracy. Further testing would be needed to determine whether this improvement represents a statistically significant trend or random variation.
    • Temperature settings had a notable impact on performance: At 0, responses were consistent but repetitive, while at 1, outputs became erratic and unpredictable.
    • German language performance improved substantially compared to previous versions. Though it's a small model, which occasionally results in overly literal translations, the overall quality of German outputs is good enough for most purposes.
    • Censorship can be completely circumvented through basic prompt engineering techniques.
    • Still undecided on how good Phi-4 is for general-purpose tasks in real use - but generally recommend you run a bigger, better model if you can.
  2. Qwen2 VL 72B Instruct:

    • Given its relatively low scores and reliance on the older Qwen2 series rather than the superior 2.5 architecture, I look forward to the release of a Qwen2.5 VL 72B model.
  3. Aya Expanse 32B:

    • While this model shows the lowest score on the graph, keep in mind that I only included models scoring above 50%. There are other, worse models that fell below this threshold and didn't make it onto the visualization at all.
    • Its main advantage is the support of 23 languages, making it a solid choice when you need those multilingual capabilities and have no better alternatives. Of course, if your target language is supported by a better model, use that instead.

About the Benchmark

The MMLU-Pro benchmark is a comprehensive evaluation of large language models across various categories, including computer science, mathematics, physics, chemistry, and more. It's designed to assess a model's ability to understand and apply knowledge across a wide range of subjects, providing a robust measure of general intelligence. While it is a multiple choice test, instead of 4 answer options like in its predecessor MMLU, there are now 10 options per question, which drastically reduces the probability of correct answers by chance. Additionally, the focus is increasingly on complex reasoning tasks rather than pure factual knowledge.

For my benchmarks, I currently limit myself to the Computer Science category with its 410 questions. This pragmatic decision is based on several factors: First, I place particular emphasis on responses from my usual work environment, since I frequently use these models in this context during my daily work. Second, with local models running on consumer hardware, there are practical constraints around computation time - a single run already takes several hours with larger models, and I generally conduct at least two runs to ensure consistency.

Unlike typical benchmarks that only report single scores, I conduct multiple test runs for each model to capture performance variability. This comprehensive approach delivers a more accurate and nuanced understanding of each model's true capabilities. By executing at least two benchmark runs per model, I establish a robust assessment of both performance levels and consistency. The results feature error bars that show standard deviation, illustrating how performance varies across different test runs.

The benchmarks for this study alone required over 103 hours of runtime. With additional categories or runs, the testing duration would have become so long with the available resources that the tested models would have been outdated by the time the study was completed. Therefore, establishing practical framework conditions and boundaries is essential to achieve meaningful results within a reasonable timeframe.

Detailed Results

Here's the complete table, including results from previous reports:

Model HF Main Model Name HF Draft Model Name (speculative decoding) Size Format API GPU GPU Mem Run Duration Total % Correct Random Guesses Prompt tokens tk/s Completion tokens tk/s
claude-3-5-sonnet-20241022 - - - - Anthropic - - 1/3 31m 50s 340/410 82.93% 694458 362.78 97438 50.90
claude-3-5-sonnet-20241022 - - - - Anthropic - - 2/3 31m 39s 338/410 82.44% 694458 364.82 97314 51.12
πŸ†• claude-3-5-sonnet-20241022 - - - - Anthropic - - 3/3 28m 56s 337/410 82.20% 867478 498.45 84785 48.72
gemini-1.5-pro-002 - - - - Gemini - - 1/3 31m 7s 335/410 81.71% 648675 346.82 78311 41.87
πŸ†• gemini-1.5-pro-002 - - - - Gemini - - 2/3 29m 52s 333/410 81.22% 648675 361.38 77030 42.91
gemini-1.5-pro-002 - - - - Gemini - - 3/3 30m 40s 327/410 79.76% 648675 351.73 76063 41.24
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 1/2 2h 3m 30s 325/410 79.27% 0/2, 0.00% 656716 88.58 327825 44.22
QwQ-32B-Preview (8.0bpw EXL2, max_tokens=16384) bartowski/QwQ-32B-Preview-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 38436MiB 2/2 2h 3m 35s 324/410 79.02% 656716 88.52 343440 46.29
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 1/3 2h 13m 5s 326/410 79.51% 656716 82.21 142256 17.81
Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 2/3 2h 14m 53s 317/410 77.32% 656716 81.11 143659 17.74
πŸ†• Athene-V2-Chat (72B, 4.65bpw EXL2, Q4 cache) wolfram/Athene-V2-Chat-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI RTX 6000 44496MiB 3/3 1h 49m 40s 312/410 76.10% 805136 122.30 115284 17.51
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 1/2 3h 7m 58s 320/410 78.05% 656716 58.21 139499 12.36
Qwen2.5-72B-Instruct (4.65bpw EXL2, Q4 cache) LoneStriker/Qwen2.5-72B-Instruct-4.65bpw-h6-exl2 - 72B EXL2 TabbyAPI 2x RTX 3090 41150MiB 2/2 3h 5m 19s 319/410 77.80% 656716 59.04 138135 12.42
DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 1/4 20m 22s 320/410 78.05% 628029 512.38 66807 54.50
DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 2/4 27m 43s 320/410 78.05% 628029 376.59 66874 40.10
DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 3/4 19m 45s 319/410 77.80% 628029 528.39 64470 54.24
DeepSeek-V3 deepseek-ai/DeepSeek-V3 - 671B HF DeepSeek - - 4/4 19m 45s 319/410 77.80% 628029 375.73 69531 41.60
gpt-4o-2024-08-06 - - - - OpenAI - - 1/2 34m 54s 320/410 78.05% 1/2, 50.00% 631448 300.79 99103 47.21
gpt-4o-2024-08-06 - - - - OpenAI - - 2/2 42m 41s 316/410 77.07% 1/3, 33.33% 631448 246.02 98466 38.36
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 1/2 40m 23s 310/410 75.61% 696798 287.13 79444 32.74
mistral-large-2407 (123B) mistralai/Mistral-Large-Instruct-2407 - 123B HF Mistral - - 2/2 46m 55s 308/410 75.12% 0/1, 0.00% 696798 247.21 75971 26.95
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 1/2 2h 5m 28s 311/410 75.85% 648580 86.11 79191 10.51
Llama-3.1-405B-Instruct-FP8 meta-llama/Llama-3.1-405B-Instruct-FP8 - 405B HF IONOS - - 2/2 2h 10m 19s 307/410 74.88% 648580 82.90 79648 10.18
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 1/2 41m 46s 302/410 73.66% 1/3, 33.33% 696798 277.70 82028 32.69
mistral-large-2411 (123B) mistralai/Mistral-Large-Instruct-2411 - 123B HF Mistral - - 2/2 32m 47s 300/410 73.17% 0/1, 0.00% 696798 353.53 77998 39.57
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 1/2 28m 17s 302/410 73.66% 2/4, 50.00% 631448 371.33 146558 86.18
chatgpt-4o-latest @ 2024-11-18 - - - - OpenAI - - 2/2 28m 31s 298/410 72.68% 2/2, 100.00% 631448 368.19 146782 85.59
gpt-4o-2024-11-20 - - - - OpenAI - - 1/2 25m 35s 296/410 72.20% 1/7, 14.29% 631448 410.38 158694 103.14
gpt-4o-2024-11-20 - - - - OpenAI - - 2/2 26m 10s 294/410 71.71% 1/7, 14.29% 631448 400.95 160378 101.84
Llama-3.3-70B-Instruct (4.0bpw EXL2) LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 - 70B EXL2 TabbyAPI RTX 6000 47148MiB 1/2 2h 2m 33s 293/410 71.46% 648580 88.15 87107 11.84
Llama-3.3-70B-Instruct (4.0bpw EXL2) LoneStriker/Llama-3.3-70B-Instruct-4.0bpw-h6-exl2 - 70B EXL2 TabbyAPI RTX 6000 47148MiB 2/2 1h 33m 59s 293/410 71.46% 534360 94.70 89510 15.86
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 1/2 41m 12s 291/410 70.98% 3/12, 25.00% 648580 261.88 102559 41.41
Llama-3.1-70B-Instruct meta-llama/Llama-3.1-70B-Instruct - 70B HF IONOS - - 2/2 39m 48s 287/410 70.00% 3/14, 21.43% 648580 271.12 106644 44.58
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 - 70B EXL2 TabbyAPI RTX 6000 40104MiB 1/2 2h 13m 3s 290/410 70.73% 640380 80.18 157235 19.69
Llama-3.1-Nemotron-70B-Instruct (4.25bpw EXL2) bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-exl2_4_25 - 70B EXL2 TabbyAPI RTX 6000 40104MiB 2/2 2h 13m 15s 287/410 70.00% 0/1, 0.00% 640380 80.07 157471 19.69
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 72B EXL2 TabbyAPI RTX 6000 46260MiB 1/2 3h 43m 12s 290/410 70.73% 1/3, 33.33% 656716 49.02 441187 32.93
QVQ-72B-Preview (4.65bpw EXL2, max_tokens=16384) wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 Qwen/Qwen2.5-Coder-0.5B-Instruct 72B EXL2 TabbyAPI RTX 6000 46260MiB 2/2 3h 47m 29s 284/410 69.27% 0/2, 0.00% 656716 48.10 450363 32.99
gemini-1.5-flash-002 - - - - Gemini - - 1/2 13m 19s 288/410 70.24% 1/6, 16.67% 648675 808.52 80535 100.38
gemini-1.5-flash-002 - - - - Gemini - - 2/2 22m 30s 285/410 69.51% 2/7, 28.57% 648675 479.42 80221 59.29
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 1/2 33m 6s 289/410 70.49% 4/7, 57.14% 640380 321.96 88997 44.74
Llama-3.2-90B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct - 90B HF Azure - - 2/2 31m 31s 281/410 68.54% 2/5, 40.00% 640380 338.10 85381 45.08
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 1/7 41m 59s 289/410 70.49% 656716 260.29 92126 36.51
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 2/7 34m 24s 286/410 69.76% 656716 317.48 89487 43.26
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-3B-Instruct 32B EXL2 TabbyAPI RTX 6000 45880MiB 3/7 41m 27s 283/410 69.02% 0/1, 0.00% 656716 263.62 90349 36.27
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 4/7 42m 32s 283/410 69.02% 0/1, 0.00% 656716 256.77 90899 35.54
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 bartowski/Qwen2.5-Coder-7B-Instruct-exl2_8_0 32B EXL2 TabbyAPI RTX 6000 43688MiB 5/7 44m 34s 282/410 68.78% 0/1, 0.00% 656716 245.24 96470 36.03
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 - 32B EXL2 TabbyAPI RTX 6000 38620MiB 6/7 1h 2m 8s 282/410 68.78% 656716 175.98 92767 24.86
Qwen2.5-Coder-32B-Instruct (8.0bpw EXL2) bartowski/Qwen2.5-Coder-32B-Instruct-exl2_8_0 Qwen/Qwen2.5-Coder-0.5B-Instruct 32B EXL2 TabbyAPI RTX 6000 40036MiB 7/7 34m 56s 280/410 68.29% 656716 312.66 91926 43.76
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 1/2 1h 26m 26s 284/410 69.27% 1/3, 33.33% 696798 134.23 79925 15.40
Mistral-Large-Instruct-2411 (123B, 3.0bpw EXL2) MikeRoz/mistralai_Mistral-Large-Instruct-2411-3.0bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 47068MiB 2/2 1h 26m 10s 275/410 67.07% 0/2, 0.00% 696798 134.67 79778 15.42
πŸ†• Phi-4 (14B, Unsloth, GGUF) unsloth/phi-4-GGUF - 14B GGUF llama.cpp RTX 6000 31978MiB 1/2 1h 19m 51s 278/410 67.80% 1/6, 16.67% 639591 133.40 133610 27.87
πŸ†• Phi-4 (14B, Unsloth, GGUF) unsloth/phi-4-GGUF - 14B GGUF llama.cpp RTX 6000 31978MiB 2/2 1h 19m 41s 278/410 67.80% 1/6, 16.67% 639591 133.67 133610 27.92
πŸ†• Phi-4 (14B, Unsloth, HF) unsloth/phi-4 - 14B HF TabbyAPI RTX 6000 1/2 1h 38m 29s 274/410 66.83% 1/3, 33.33% 635081 107.42 113731 19.24
πŸ†• Phi-4 (14B, Unsloth, HF) unsloth/phi-4 - 14B HF TabbyAPI RTX 6000 2/2 1h 39m 32s 273/410 66.59% 1/3, 33.33% 635081 106.29 113712 19.03
πŸ†• Phi-4 (14B, Microsoft, HF) microsoft/phi-4 - 14B HF TabbyAPI RTX 6000 31394MiB 1/2 1h 7m 44s 272/410 66.34% 1/3, 33.33% 635081 156.15 113358 27.87
πŸ†• Phi-4 (14B, Microsoft, HF) microsoft/phi-4 - 14B HF TabbyAPI RTX 6000 31394MiB 2/2 1h 7m 44s 271/410 66.10% 1/3, 33.33% 635081 156.10 113384 27.87
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 8m 8s 271/410 66.10% 696798 170.29 66670 16.29
Mistral-Large-Instruct-2407 (123B, 2.75bpw EXL2) turboderp/Mistral-Large-Instruct-2407-123B-exl2_2.75bpw - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 10m 38s 268/410 65.37% 1/3, 33.33% 696798 164.23 69182 16.31
πŸ†• Qwen2-VL-72B-Instruct (4.5bpw EXL2) turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw - 72B EXL2 TabbyAPI RTX 6000 43554MiB 1/2 1h 10m 51s 255/410 62.20% 30/3, 0.00% 656716 154.36 71752 16.87
πŸ†• Qwen2-VL-72B-Instruct (4.5bpw EXL2) turboderp/Qwen2-VL-72B-Instruct-exl2_4.5bpw - 72B EXL2 TabbyAPI RTX 6000 43554MiB 2/2 1h 26m 40s 255/410 62.20% 1/6, 16.67% 656716 126.20 88249 16.96
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 1/2 1h 11m 50s 267/410 65.12% 1/4, 25.00% 696798 161.53 70538 16.35
Mistral-Large-Instruct-2411 (123B, 2.75bpw EXL2) wolfram/Mistral-Large-Instruct-2411-2.75bpw-h6-exl2 - 123B EXL2 TabbyAPI RTX 6000 45096MiB 2/2 1h 13m 50s 243/410 59.27% 0/4, 0.00% 696798 157.18 72718 16.40
Falcon3-10B-Instruct tiiuae/Falcon3-10B-Instruct - 10B HF Ollama RTX 6000 20906MiB 1/2 35m 15s 251/410 61.22% 2/5, 40.00% 702578 331.57 75501 35.63
Falcon3-10B-Instruct tiiuae/Falcon3-10B-Instruct - 10B HF Ollama RTX 6000 20906MiB 2/2 35m 21s 251/410 61.22% 2/5, 40.00% 702578 330.66 75501 35.53
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 1/2 25m 3s 243/410 59.27% 1/4, 25.00% 696798 462.38 73212 48.58
mistral-small-2409 (22B) mistralai/Mistral-Small-Instruct-2409 - 22B HF Mistral - - 2/2 20m 45s 239/410 58.29% 1/4, 25.00% 696798 558.10 76017 60.89
πŸ†• Aya-Expanse-32B (8.0bpw EXL2) lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw - 32B EXL2 TabbyAPI RTX 6000 33686MiB 1/2 43m 18s 212/410 51.71% 0/1, 0.00% 661930 254.04 60728 23.31
πŸ†• Aya-Expanse-32B (8.0bpw EXL2) lucyknada/CohereForAI_aya-expanse-32b-exl2_8.0bpw - 32B EXL2 TabbyAPI RTX 6000 33686MiB 2/2 42m 27s 211/410 51.46% 0/4, 0.00% 661930 259.50 59557 23.35
  • Model: Model name (with relevant parameter and setting details)
  • HF Main Model Name: Full name of the tested model as listed on Hugging Face
  • HF Draft Model Name (speculative decoding): Draft model used for speculative decoding (if applicable)
  • Size: Parameter count
  • Format: Model format type (HF, EXL2, etc.)
  • API: Service provider (TabbyAPI indicates local deployment)
  • GPU: Graphics card used for this benchmark run
  • GPU Mem: VRAM allocated to model and configuration
  • Run: Benchmark run sequence number
  • Duration: Total runtime of benchmark
  • Total: Number of correct answers (determines ranking!)
  • %: Percentage of correct answers
  • Correct Random Guesses: When MMLU-Pro cannot definitively identify a model's answer choice, it defaults to random guessing and reports both the number of these random guesses and their accuracy (a high proportion of random guessing indicates problems with following the response format)
  • Prompt tokens: Token count of input text
  • tk/s: Tokens processed per second
  • Completion tokens: Token count of generated response
  • tk/s: Tokens generated per second

Wolfram Ravenwolf is a German AI Engineer and an internationally active consultant and renowned researcher who's particularly passionate about local language models. You can follow him on X and Bluesky, read his previous LLM tests and comparisons on HF and Reddit, check out his models on Hugging Face, tip him on Ko-fi, or book him for a consultation.