some benchmark results for ZebraLogic

#10
by khanh2023 - opened

I tried to benchmark on a subset of ZebraLogic consists of all 3x3 and 4x4 examples. In total, there are 80 data points. I used the recommended sampling parameters from the technical report. Moreover, I also set the maximum completion tokens to be 40k which is plenty for ZebraLogic. On average, VibeThinker-3B use 7000 tokens for each question including refining json output through iteratively asking it to output the correct format.

In total 80 examples, the bf16 version output correctly 54 times first try. There are 3 examples where it cannot output the correct json format after 5 retries, even if we give it 3 of them correct, the accuracy of VibeThinker-3B on reach 85%, similar to Qwen3.5 4B q4

Below is the benchmark numbers for some of the models I have access to.

Screenshot 2026-06-19 at 12.21.12

WeiboAI org

Thanks for running this benchmark β€” this is very useful.

One thing I’d be curious about is how much of the result comes from the mixed_4_6 quantization. VibeThinker-3B seems quite sensitive to quantization, especially on tasks that require both long reasoning and strict structured output like JSON.

Would you be willing to also test the unquantized HF version if possible? That would help us understand whether the JSON failures are mainly from the base model behavior, the quantization, or the inference/template setup.

Either way, the self-correction after parse errors is an interesting signal. Thanks again for sharing the numbers.

I have updated the results for both mixed_4_6 quantization and bf16, VibeThinker-3B is no where near frontier level in logical reasoning and even the full precision model doesn't surpass Qwen3.5 q4

Sign up or log in to comment