skatzR commited on
Commit
37f456e
·
verified ·
1 Parent(s): 378335c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +331 -3
README.md CHANGED
@@ -1,3 +1,331 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - FacebookAI/xlm-roberta-large
5
+ language:
6
+ - ru
7
+ tags:
8
+ - reasoning
9
+ - logical-analysis
10
+ - text-classification
11
+ - ai-safety
12
+ - evaluation
13
+ - judge-model
14
+ - argumentation
15
+ pipeline_tag: text-classification
16
+ ---
17
+
18
+ # RQA — Reasoning Quality Analyzer (R2)
19
+
20
+ **RQA-R2** is a **judge model** for reasoning-quality evaluation.
21
+ It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.
22
+
23
+ > RQA is a judge, not a teacher and not a generator.
24
+
25
+ ---
26
+
27
+ ## What Is New in R2 Compared to R1
28
+
29
+ R2 is not just a retrain of R1. It is a full methodological upgrade.
30
+
31
+ ### Core differences
32
+
33
+ - **R1** used a more limited 2-signal setup.
34
+ - **R2** uses a strict **3-head ontology**:
35
+ - `has_issue`
36
+ - `is_hidden`
37
+ - `error_types`
38
+
39
+ ### Key improvements in R2
40
+
41
+ - explicit hidden-problem modeling instead of weaker implicit logic
42
+ - strict `logical / hidden / explicit` inference contract
43
+ - honest `train / val / calib / test` split
44
+ - separate calibration split for temperatures and thresholds
45
+ - per-class thresholds for error types
46
+ - uncertainty-aware inference with `status=uncertain` and `review_required`
47
+ - duplicate and conflict-duplicate filtering in the loader
48
+ - truncation audit and richer evaluation reports
49
+ - better optimizer setup for transformer fine-tuning
50
+ - staged encoder fine-tuning with freeze/unfreeze
51
+ - stronger schema/version safety for inference artifacts
52
+
53
+ In short:
54
+
55
+ > **R1** was a strong prototype.
56
+ > **R2** is the first version that behaves like a full training + calibration + inference pipeline.
57
+
58
+ ---
59
+
60
+ ## What Problem RQA-R2 Solves
61
+
62
+ Texts written by humans or LLMs can:
63
+
64
+ - sound coherent
65
+ - use correct vocabulary
66
+ - appear persuasive
67
+
68
+ ...while still containing **reasoning problems** that are:
69
+
70
+ - subtle
71
+ - structural
72
+ - hidden in argumentation
73
+
74
+ RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.
75
+
76
+ ---
77
+
78
+ ## Model Overview
79
+
80
+ | Property | Value |
81
+ |---|---|
82
+ | Model Type | Judge / Evaluator |
83
+ | Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
84
+ | Pooling | Mean pooling |
85
+ | Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
86
+ | Language | Russian |
87
+ | License | MIT |
88
+
89
+ ---
90
+
91
+ ## What the Model Predicts
92
+
93
+ RQA-R2 predicts three connected outputs.
94
+
95
+ ### 1. Logical Issue Detection
96
+
97
+ - `has_logical_issue ∈ {false, true}`
98
+ - calibrated probability available
99
+
100
+ ### 2. Hidden Problem Detection
101
+
102
+ - `is_hidden_problem ∈ {false, true}`
103
+ - evaluated only when a reasoning issue exists
104
+
105
+ ### 3. Explicit Error Type Classification
106
+
107
+ If the text is classified as `explicit`, the model may assign one or more of the following error types:
108
+
109
+ - `false_causality`
110
+ - `unsupported_claim`
111
+ - `overgeneralization`
112
+ - `missing_premise`
113
+ - `contradiction`
114
+ - `circular_reasoning`
115
+
116
+ This is a **multi-label** prediction head.
117
+
118
+ ---
119
+
120
+ ## Ontology
121
+
122
+ R2 uses a strict three-class reasoning ontology.
123
+
124
+ ### `logical`
125
+
126
+ - no reasoning issue
127
+ - no hidden problem
128
+ - no explicit errors
129
+
130
+ ### `hidden`
131
+
132
+ - reasoning problem exists
133
+ - no explicit labeled fallacy
134
+ - the issue is structural, implicit, or argumentative
135
+
136
+ ### `explicit`
137
+
138
+ - reasoning problem exists
139
+ - at least one explicit error type is present
140
+
141
+ This ontology is enforced in both training and inference.
142
+
143
+ ---
144
+
145
+ ## Inference Contract
146
+
147
+ RQA-R2 uses gated inference:
148
+
149
+ - if `has_issue = false` -> class is `logical`, no errors are returned
150
+ - if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
151
+ - if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned
152
+
153
+ R2 also supports:
154
+
155
+ - calibrated thresholds
156
+ - `uncertain` mode
157
+ - `review_required` for borderline cases
158
+
159
+ ---
160
+
161
+ ## Architecture
162
+
163
+ RQA-R2 is built on top of **XLM-RoBERTa Large** with:
164
+
165
+ - mean pooling
166
+ - separate projections per task
167
+ - separate dropout per head
168
+ - 3 task-specific heads
169
+ - uncertainty-weighted multi-task training
170
+
171
+ Training is hierarchical:
172
+
173
+ - `has_issue` is trained on all samples
174
+ - `is_hidden` is trained only on problem samples
175
+ - `error_types` are trained only on explicit samples
176
+
177
+ ---
178
+
179
+ ## Training and Calibration
180
+
181
+ R2 uses an honest experimental structure:
182
+
183
+ - `train` for fitting
184
+ - `val` for model selection
185
+ - `calib` for temperature scaling and threshold tuning
186
+ - `test` for final held-out evaluation
187
+
188
+ Calibration includes:
189
+
190
+ - issue temperature
191
+ - hidden temperature
192
+ - per-class error temperatures
193
+ - threshold selection for `has_issue`
194
+ - threshold selection for `is_hidden`
195
+ - per-class thresholds for error types
196
+
197
+ ---
198
+
199
+ ## Held-Out Synthetic Benchmark
200
+
201
+ The following metrics were obtained on the current held-out synthetic test split used for R2:
202
+
203
+ - `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
204
+ - `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
205
+ - `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
206
+ - `Top-level class macro-F1 = 0.964`
207
+ - `Coverage = 95.6%`
208
+ - `Uncertain rate = 4.4%`
209
+
210
+ These are strong results for the current data regime.
211
+
212
+ Important:
213
+
214
+ > These metrics are measured on a held-out split from the current synthetic dataset.
215
+ > They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.
216
+
217
+ ---
218
+
219
+ ## Training Data
220
+
221
+ RQA-R2 was trained on a custom reasoning-quality dataset with:
222
+
223
+ - `7292` total samples
224
+ - `3150` logical texts
225
+ - `4142` problematic texts
226
+ - `1242` hidden problems
227
+ - `2900` explicit cases
228
+
229
+ Error-label counts:
230
+
231
+ - `false_causality`: `518`
232
+ - `unsupported_claim`: `524`
233
+ - `overgeneralization`: `599`
234
+ - `missing_premise`: `537`
235
+ - `contradiction`: `475`
236
+ - `circular_reasoning`: `540`
237
+
238
+ Multi-label explicit cases:
239
+
240
+ - `293`
241
+
242
+ The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.
243
+
244
+ ---
245
+
246
+ ## Intended Use
247
+
248
+ ### Recommended for
249
+
250
+ - reasoning-quality evaluation
251
+ - LLM output auditing
252
+ - AI safety pipelines
253
+ - judge/reranker pipelines
254
+ - pre-filtering for downstream review
255
+ - analytical tooling around argument structure
256
+
257
+ ### Not intended for
258
+
259
+ - text generation
260
+ - explanation generation
261
+ - automatic rewriting or correction
262
+ - factual verification
263
+ - legal or scientific truth adjudication
264
+
265
+ ---
266
+
267
+ ## Output Example
268
+
269
+ ```json
270
+ {
271
+ "class": "explicit",
272
+ "status": "ok",
273
+ "review_required": false,
274
+ "has_logical_issue": true,
275
+ "has_issue_probability": 0.9993,
276
+ "is_hidden_problem": false,
277
+ "hidden_probability": 0.021,
278
+ "errors": [
279
+ {
280
+ "type": "missing_premise",
281
+ "probability": 0.923,
282
+ "threshold": 0.54
283
+ }
284
+ ]
285
+ }
286
+ ```
287
+
288
+ ---
289
+
290
+ ## Limitations
291
+
292
+ RQA-R2 still has important limits:
293
+
294
+ - it evaluates reasoning structure, not factual truth
295
+ - hidden problems remain partly subjective by nature
296
+ - the current benchmark is still synthetic and in-distribution
297
+ - real human-written texts and outputs from other LLMs may be harder
298
+ - the model should still be validated externally before being treated as a fully general reasoning judge
299
+
300
+ Also note:
301
+
302
+ - R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
303
+ - if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment
304
+
305
+ ---
306
+
307
+ ## Recommended Next Step
308
+
309
+ The best next step after R2 is external validation on:
310
+
311
+ - human-written argumentative texts
312
+ - outputs from other LLM families
313
+ - paraphrased and adversarially reworded samples
314
+ - harder hidden-problem cases
315
+
316
+ That is the correct way to turn a strong in-distribution result into a robust real-world system.
317
+
318
+ ---
319
+
320
+ ## Summary
321
+
322
+ RQA-R2 is a major upgrade over R1:
323
+
324
+ - better ontology
325
+ - better training logic
326
+ - better calibration
327
+ - better inference safety
328
+ - stronger held-out synthetic performance
329
+
330
+ R1 proved the idea.
331
+ **R2 is the first version that fully validates it.**