AIDC-AI/Marco-o1 · explain why the performance differs between the En and Zh MGSM

We believe that the performance difference between zh and en is due to the limitations of our current reward design. The current work's MCTS relies on the model's output probabilities, which might be strongly influenced by the higher baseline in English.
We need to emphasize that our MCTS results are not akin to sampling like PASS@K, but rather use the optimal reasoning path from the MCTS output as the result. Therefore, this result is significantly affected by the reward. In fact, we have observed that there are better outputs among other paths found during the MCTS.
We anticipate that this issue will be resolved later with the MCTS + reward model.