RT688 commited on
Commit
d13dcd1
·
verified ·
1 Parent(s): 3e64cd4

Upload README

Browse files
Files changed (1) hide show
  1. README.md +216 -0
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - chemistry
7
+ - physics
8
+ - math
9
+ - biology
10
+ - science
11
+ pretty_name: open-rl
12
+ size_categories:
13
+ - n<1K
14
+ task_categories:
15
+ - question-answering
16
+ ---
17
+
18
+ # Open-RL
19
+
20
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
21
+ [![Turing](https://img.shields.io/badge/Org-Turing-blue)](https://turing.com)
22
+
23
+ ---
24
+
25
+ ## Dataset Summary
26
+
27
+ This dataset contains **self-contained, verifiable, and unambiguous STEM reasoning problems** across Physics, Mathematics, Biology, and Chemistry.
28
+
29
+ Each problem:
30
+
31
+ * Requires multi-step reasoning
32
+ * Involves symbolic manipulation and/or numerical computation
33
+ * Has a deterministic, objectively verifiable final answer
34
+
35
+ The problems were evaluated against contemporary large language models. Observed pass rates indicate that the tasks are **non-trivial yet solvable**, placing them within reach of advanced models while still exposing meaningful reasoning gaps.
36
+
37
+ This makes the dataset particularly suitable for:
38
+
39
+ * Reinforcement learning (RL) fine-tuning
40
+ * Reward modeling
41
+ * Outcome-supervised training
42
+ * Verifiable reasoning benchmarks
43
+
44
+ ---
45
+
46
+ ## Dataset Structure
47
+
48
+ | Field | Type | Description |
49
+ | ----------------- | ------ | ----------------------------------------- |
50
+ | `conversation_id` | string | Unique identifier for each QA pair. |
51
+ | `domain` | string | Physics, Math, Chemistry, Biology. |
52
+ | `sub_domain` | string | Specific discipline. |
53
+ | `question` | string | STEM problem statement (LaTeX supported). |
54
+ | `answer` | string | Deterministic ground-truth solution. |
55
+
56
+ ---
57
+
58
+ ## Example
59
+
60
+ ```json
61
+ {
62
+ "conversation_id": "217998",
63
+ "domain": "Physics",
64
+ "sub_domain": "Astrophysics",
65
+ "question": "Consider a Navarro–Frenk–White (NFW) dark matter halo profile where...",
66
+ "answer": "\( \frac{4GM_{0}}{r_{0}} + \frac{16\pi Gk}{r_{0}}\left[ \ln\left(\frac{r_{0}}{r_{s}}\right) + 0.31 \right] \)"
67
+ }
68
+ ```
69
+
70
+ ---
71
+
72
+ ## Verifiability and Automatic Grading
73
+
74
+ A core design principle of this dataset is **objective verifiability**.
75
+
76
+ Each problem is constructed such that:
77
+
78
+ * The final answer is deterministic
79
+ * Correctness can be evaluated programmatically
80
+ * No subjective interpretation is required
81
+ * There is a clear separation between reasoning steps and final outcome
82
+
83
+ ### Answer Types
84
+
85
+ The dataset includes answers that are:
86
+
87
+ * Closed-form symbolic expressions
88
+ * Numerical scalars
89
+ * Algebraic identities
90
+ * Simplified analytic forms
91
+ * Canonical LaTeX representations
92
+
93
+ Because answers are deterministic, evaluation can be performed via:
94
+
95
+ * Exact string matching (after normalization)
96
+ * Symbolic equivalence checking (e.g., SymPy)
97
+ * Numerical tolerance comparison
98
+ * Unit consistency validation (where applicable)
99
+
100
+ ---
101
+
102
+ ## Data Quality Assurance Process
103
+
104
+ To ensure scientific validity of the answer, all tasks are prepared and reviewed twice by PhD experts.
105
+
106
+ Key quality rubrics include:
107
+
108
+ * Prompt and answer accuracy
109
+ * Clarity of prompt and underlying reasoning
110
+ * Expert-verified model breaking cases due to model’s incorrect reasoning process
111
+ * Google-proof originality validation.
112
+
113
+ ---
114
+
115
+ ## Reinforcement Learning and Outcome Supervision
116
+
117
+ This dataset is designed to support **outcome-based reinforcement learning** for reasoning models.
118
+
119
+ In contrast to preference-based RL (RLHF), which relies on subjective ranking signals, this dataset enables:
120
+
121
+ * Outcome-supervised reinforcement learning (OSRL)
122
+ * Deterministic reward assignment
123
+ * Binary or graded correctness rewards
124
+ * Scalable automated evaluation
125
+
126
+ ### Example RL Setup
127
+
128
+ Given:
129
+
130
+ * Prompt: `question`
131
+ * Model output: predicted final answer
132
+
133
+ Reward can be computed as:
134
+
135
+ * `+1` if the final answer matches ground truth
136
+ * `0` or `-1` otherwise
137
+ * Optional partial credit via symbolic or numerical closeness
138
+
139
+ This allows:
140
+
141
+ * Policy gradient methods (e.g., PPO)
142
+ * Direct optimization against correctness signals
143
+ * Reward model bootstrapping
144
+ * Iterative self-improvement pipelines
145
+
146
+ ### Calibration Regime
147
+
148
+ The problems were stress-tested against advanced language models and found to be:
149
+
150
+ * Not trivially solved
151
+ * Not universally failed
152
+ * Within the capability frontier of modern LLMs
153
+
154
+ This places them in a **learning-efficient regime**:
155
+
156
+ * Hard enough to produce gradient signal
157
+ * Solvable enough to avoid reward sparsity
158
+ * Suitable for curriculum-style training
159
+
160
+ ---
161
+
162
+
163
+ ## Future Directions: NuRL and Structured Nudging
164
+
165
+ We plan to extend this dataset with additional problem sets and a structured **"nudge" augmentation layer** inspired by the paper *["Nudging the Boundaries of LLM Reasoning"](https://arxiv.org/html/2509.25666v1)*.
166
+
167
+ ### Motivation
168
+
169
+ Standard online RL algorithms (e.g., GRPO-style approaches) can only learn from problems where the model occasionally produces correct rollouts. For sufficiently difficult problems with a **0% pass rate**, no reward signal is generated, and therefore no gradient updates occur. As a result, such problems cannot contribute to expanding the model’s reasoning frontier.
170
+
171
+ ### NuRL-Style Nudging
172
+
173
+ To address this limitation, future versions of this dataset will include:
174
+
175
+ * Abstract, high-level **hints ("nudges")**
176
+ * Hints generated conditionally using the gold answer
177
+ * Carefully designed cues that reduce problem difficulty without revealing the solution
178
+
179
+ Under a NuRL-style training pipeline:
180
+
181
+ 1. Rollouts are first generated without hints.
182
+ 2. If pass rate > 0%, standard RL proceeds.
183
+ 3. If pass rate = 0%, a structured hint is injected.
184
+ 4. A new batch of trajectories is generated with the hint.
185
+
186
+ This enables:
187
+
188
+ * Previously unsolvable samples to produce non-zero rewards
189
+ * Learning signal from frontier-level problems
190
+ * Expansion of the model’s upper reasoning bound
191
+
192
+ ### Design Principles for Effective Nudges
193
+
194
+ Planned nudges will follow empirical findings from prior work:
195
+
196
+ * Hints should be **abstract and knowledge-oriented**, not answer-revealing
197
+ * Hints should preserve distributional alignment with base policy reasoning
198
+ * Hints should be injected only when necessary
199
+ * Nudges are most effective after base RL convergence
200
+
201
+ ---
202
+
203
+ This evolution positions the dataset not only as a verifiable benchmark, but as a controlled testbed for **upper-bound expansion in reinforcement learning for reasoning models**.
204
+
205
+ ---
206
+
207
+ ## Citation
208
+
209
+ ```bibtex
210
+ @dataset{turing_2026_open_rl,
211
+ title = {Open-RL },
212
+ author = {Saurabh Patil, Anshuman Lall, Marko Pavlovic , Chinmayee Shukla, Seetesh Pande, Tejass Mohan Ukarde , Amanda Gollo Bertollo, Mahesh Joshi, Kihwan Han},
213
+ year = {2026},
214
+ url = {https://huggingface.co/datasets/TuringEnterprises/Open-RL}
215
+ }
216
+ ```