Update README.md
Browse files
README.md
CHANGED
@@ -24,12 +24,14 @@ We also ensured that the model’s math and reasoning abilities remained intact
|
|
24 |
Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model,
|
25 |
indicating that the decensoring had no impact on its core reasoning capabilities.
|
26 |
|
27 |
-
| Benchmark | R1-Distill-
|
28 |
| --- | --- | --- |
|
29 |
| China Censorship | 80.53 | 0.2 |
|
30 |
-
| Internal Benchmarks (avg) |
|
31 |
| AIME 2024 | 70 | 70 |
|
32 |
| MATH-500 | 94.5 | 94.8 |
|
33 |
-
| MMLU | 88.52 | 88.20 |
|
34 |
-
| DROP | 84.55 | 84.83 |
|
35 |
-
| GPQA | 65.2 | 65.05 |
|
|
|
|
|
|
24 |
Evaluations on multiple benchmarks showed that our post-trained model performed on par with the base R1 model,
|
25 |
indicating that the decensoring had no impact on its core reasoning capabilities.
|
26 |
|
27 |
+
| Benchmark | R1-Distill-Llama-70B | R1-1776-Distill-Llama-70B |
|
28 |
| --- | --- | --- |
|
29 |
| China Censorship | 80.53 | 0.2 |
|
30 |
+
| Internal Benchmarks (avg) | 47.64 | 48.4 |
|
31 |
| AIME 2024 | 70 | 70 |
|
32 |
| MATH-500 | 94.5 | 94.8 |
|
33 |
+
| MMLU | 88.52 * | 88.20 |
|
34 |
+
| DROP | 84.55 * | 84.83 |
|
35 |
+
| GPQA | 65.2 | 65.05 |
|
36 |
+
|
37 |
+
\* Evaluated by Perplexity AI since they were not reported in the [paper](https://arxiv.org/abs/2501.12948).
|