nvidia
/

OpenReasoning-Nemotron-14B

@@ -22,37 +22,48 @@ GOVERNING TERMS: Use of the models listed above are governed by the [Creative Co
 ## Scores on Reasoning Benchmarks
-| **Model** | **AritificalAnalysisIndex** | **GPQA** | **MMLU-PRO** | **HLE** | **LiveCodeBench** | **SciCode** | **AIME24** | **AIME25** | **HMMT FEB 25** | **BRUMO25** |
-| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
-| **1.5B**| - | 31.6 | 47.5 | 5.5 | 28.6 | 2.2 | 55.5 | 45.6 | 31.5 | 50.6 |
-| **7B** | 54.7 | 61.1 | 71.9 | 8.3 | 63.3 | 16.2 | 84.7 | 78.2 | 63.5 | 80.3 |
-| **14B** | 60.9 | 71.6 | 77.5 | 10.1 | 67.8 | 23.5 | 87.8 | 82.0 | 71.2 | 87.7 |
-| **32B** | 64.3 | 73.1 | 80.0 | 11.9 | 70.2 | 28.5 | 89.2 | 84.0 | 73.8 | 88.0 |
-## Scores for Math Reasoning Benchmarks with GenSelect
-| **Model** | **Pass@1 (Avg@64)** | **Majority@64** | **GenSelect@64** |
 | :--- | :--- | :--- | :--- |
 | **1.5B** | | | |
 | **AIME24** | 55.5 | 76.7 | 76.7 |
 | **AIME25** | 45.6 | 70.0 | 70.0 |
 | **HMMT Feb 25** | 31.5 | 46.7 | 53.3 |
-| **BRUNO25** | 50.6 | 70.0 | 73.3 |
 | **7B** | | | |
 | **AIME24** | 84.7 | 93.3 | 93.3 |
 | **AIME25** | 78.2 | 86.7 | 93.3 |
 | **HMMT Feb 25** | 63.5 | 83.3 | 90.0 |
-| **BRUNO25** | 80.3 | 93.3 | 96.7 |
 | **14B** | | | |
 | **AIME24** | 87.8 | 93.3 | 93.3 |
 | **AIME25** | 82.0 | 90.0 | 90.0 |
 | **HMMT Feb 25** | 71.2 | 86.7 | 93.3 |
-| **BRUNO25** | 87.7 | 96.7 | 96.7 |
 | **32B** | | | |
 | **AIME24** | 89.2 | 93.3 | 93.3 |
 | **AIME25** | 84.0 | 90.0 | 93.3 |
 | **HMMT Feb 25** | 73.8 | 86.7 | 96.7 |
-| **BRUNO25** | 88.0 | 96.7 | 100.0 |
 ## How to use the models?
@@ -80,8 +91,7 @@ You must use ```python for just the final solution code block with the following
 """
 # Math generation prompt
-# prompt = """
-# Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{{}}.
 #
 # {user}
 # """
@@ -155,9 +165,9 @@ This model is intended for developers and researchers who work on competitive ma
 Huggingface [07/16/2025] via https://huggingface.co/nvidia/OpenReasoning-Nemotron-14B/ <br>
 ## Reference(s):
-[2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
-[2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
-[2504.16891] AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
 <br>
 ## Model Architecture: <br>

 ## Scores on Reasoning Benchmarks
+![Evaluation Results with pass@1](https://raw.githubusercontent.com/NVIDIA/NeMo-Skills/main/docs/releases/openreasoning/pass-1.png)
+Our models demonstrate exceptional performance across a suite of challenging reasoning benchmarks. The 7B, 14B, and 32B models consistently set new state-of-the-art records for their size classes.
+| **Model** | **AritificalAnalysisIndex*** | **GPQA** | **MMLU-PRO** | **HLE** | **LiveCodeBench*** | **SciCode** | **AIME24** | **AIME25** | **HMMT FEB 25**  |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| **1.5B**| 31.0 | 31.6 | 47.5 | 5.5 | 28.6 | 2.2 | 55.5 | 45.6 | 31.5 |
+| **7B** | 54.7 | 61.1 | 71.9 | 8.3 | 63.3 | 16.2 | 84.7 | 78.2 | 63.5 |
+| **14B** | 60.9 | 71.6 | 77.5 | 10.1 | 67.8 | 23.5 | 87.8 | 82.0 | 71.2 |
+| **32B** | 64.3 | 73.1 | 80.0 | 11.9 | 70.2 | 28.5 | 89.2 | 84.0 | 73.8 |
+\* This is our estimation of the Artificial Analysis Intelligence Index, not an official score.
+\* LiveCodeBench version 6, date range 2408-2505.
+## Combining the work of multiple agents
+OpenReasoning-Nemotron models can be used in a "heavy" mode by starting multiple parallel generations and combining them together via [generative solution selection (GenSelect)](https://arxiv.org/abs/2504.16891). To add this "skill" we follow the original GenSelect training pipeline except we do not train on the selection summary but use the full reasoning trace of DeepSeek R1 0528 671B instead. We only train models to select the best solution for math problems but surprisingly find that this capability directly generalizes to code and science questions! With this "heavy" GenSelect inference mode, OpenReasoning-Nemotron-32B model surpasses O3 (High) on math and coding benchmarks.
+![Evaluation Results with GenSelect](https://raw.githubusercontent.com/NVIDIA/NeMo-Skills/main/docs/releases/openreasoning/genselect.png)
+| **Model** | **Pass@1 (Avg@64)** | **Majority@64** | **GenSelect** |
 | :--- | :--- | :--- | :--- |
 | **1.5B** | | | |
 | **AIME24** | 55.5 | 76.7 | 76.7 |
 | **AIME25** | 45.6 | 70.0 | 70.0 |
 | **HMMT Feb 25** | 31.5 | 46.7 | 53.3 |
 | **7B** | | | |
 | **AIME24** | 84.7 | 93.3 | 93.3 |
 | **AIME25** | 78.2 | 86.7 | 93.3 |
 | **HMMT Feb 25** | 63.5 | 83.3 | 90.0 |
+| **LCB v6 2408-2505** | 63.4 | n/a | 67.7 |
 | **14B** | | | |
 | **AIME24** | 87.8 | 93.3 | 93.3 |
 | **AIME25** | 82.0 | 90.0 | 90.0 |
 | **HMMT Feb 25** | 71.2 | 86.7 | 93.3 |
+| **LCB v6 2408-2505** | 67.9 | n/a | 69.1 |
 | **32B** | | | |
 | **AIME24** | 89.2 | 93.3 | 93.3 |
 | **AIME25** | 84.0 | 90.0 | 93.3 |
 | **HMMT Feb 25** | 73.8 | 86.7 | 96.7 |
+| **LCB v6 2408-2505** | 70.2 | n/a | 75.3 |
+| **HLE** | 11.8 | 13.4 | 15.5 |
 ## How to use the models?
 """
 # Math generation prompt
+# prompt = """Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.
 #
 # {user}
 # """
 Huggingface [07/16/2025] via https://huggingface.co/nvidia/OpenReasoning-Nemotron-14B/ <br>
 ## Reference(s):
+* [2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
+* [2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
+* [2504.16891] AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
 <br>
 ## Model Architecture: <br>