Update README.md
Browse files
README.md
CHANGED
|
@@ -105,8 +105,8 @@ It currently stands as the **best open-source flagship non-thinking model**, riv
|
|
| 105 |
| | mbpp | 90.69 | 89.96 | **91.72** | 91.01 | __<span style="color:red">96.87</span>__ |
|
| 106 |
| | LiveCodeBench (2408-2505) | 48.02 | 48.95 | **48.57** | 45.43 | __<span style="color:red">61.68</span>__ |
|
| 107 |
| | CodeForces-rating | 1582 | 1574 | 1120 | **1675** | __<span style="color:red">1901</span>__ |
|
| 108 |
-
| **Coding** | **Software Development** | | | | | |
|
| 109 |
| | BIRD_SQL | 44.88 | 46.45 | 43.97 | __<span style="color:red">54.76</span>__ | **52.38** |
|
|
|
|
| 110 |
| | ArtifactsBench | 43.29 | 44.87 | 41.04 | __<span style="color:red">60.28</span>__ | **59.31** |
|
| 111 |
| | FullStack Bench | **55.48** | 54.00 | 50.92 | 48.19 | __<span style="color:red">56.55</span>__ |
|
| 112 |
| | Aider | **88.16** | 85.34 | 84.40 | __<span style="color:red">89.85</span>__ | 83.65 |
|
|
@@ -121,7 +121,7 @@ It currently stands as the **best open-source flagship non-thinking model**, riv
|
|
| 121 |
| | OptMATH | 35.99 | 35.84 | 39.16 | **42.77** | __<span style="color:red">57.68</span>__ |
|
| 122 |
| **General Reasoning** | | | | | | |
|
| 123 |
| | BBEH | **42.86** | 34.83 | 39.75 | 29.08 | __<span style="color:red">47.34</span>__ |
|
| 124 |
-
| | KOR-Bench | 73.76 | 73.20 | 70.56 | 59.68 | __<span style="color:red">76.00</span>__ |
|
| 125 |
| | ARC-AGI-1 | 14.69 | **22.19** | 14.06 | 18.94 | __<span style="color:red">43.81</span>__ |
|
| 126 |
| | ZebraLogic | 81.6 | **85.5** | 57.3 | 70.2 | __<span style="color:red">90.8</span>__ |
|
| 127 |
| **Agent** | | | | | | |
|
|
|
|
| 105 |
| | mbpp | 90.69 | 89.96 | **91.72** | 91.01 | __<span style="color:red">96.87</span>__ |
|
| 106 |
| | LiveCodeBench (2408-2505) | 48.02 | 48.95 | **48.57** | 45.43 | __<span style="color:red">61.68</span>__ |
|
| 107 |
| | CodeForces-rating | 1582 | 1574 | 1120 | **1675** | __<span style="color:red">1901</span>__ |
|
|
|
|
| 108 |
| | BIRD_SQL | 44.88 | 46.45 | 43.97 | __<span style="color:red">54.76</span>__ | **52.38** |
|
| 109 |
+
| **Coding** | **Software Development** | | | | | |
|
| 110 |
| | ArtifactsBench | 43.29 | 44.87 | 41.04 | __<span style="color:red">60.28</span>__ | **59.31** |
|
| 111 |
| | FullStack Bench | **55.48** | 54.00 | 50.92 | 48.19 | __<span style="color:red">56.55</span>__ |
|
| 112 |
| | Aider | **88.16** | 85.34 | 84.40 | __<span style="color:red">89.85</span>__ | 83.65 |
|
|
|
|
| 121 |
| | OptMATH | 35.99 | 35.84 | 39.16 | **42.77** | __<span style="color:red">57.68</span>__ |
|
| 122 |
| **General Reasoning** | | | | | | |
|
| 123 |
| | BBEH | **42.86** | 34.83 | 39.75 | 29.08 | __<span style="color:red">47.34</span>__ |
|
| 124 |
+
| | KOR-Bench | **73.76** | 73.20 | 70.56 | 59.68 | __<span style="color:red">76.00</span>__ |
|
| 125 |
| | ARC-AGI-1 | 14.69 | **22.19** | 14.06 | 18.94 | __<span style="color:red">43.81</span>__ |
|
| 126 |
| | ZebraLogic | 81.6 | **85.5** | 57.3 | 70.2 | __<span style="color:red">90.8</span>__ |
|
| 127 |
| **Agent** | | | | | | |
|