skatzR
/

RQA-R2

+---
+license: mit
+base_model:
+  - FacebookAI/xlm-roberta-large
+language:
+  - ru
+tags:
+  - reasoning
+  - logical-analysis
+  - text-classification
+  - ai-safety
+  - evaluation
+  - judge-model
+  - argumentation
+pipeline_tag: text-classification
+---
+# RQA — Reasoning Quality Analyzer (R2)
+**RQA-R2** is a **judge model** for reasoning-quality evaluation.
+It does **not** generate, rewrite, or explain text. Instead, it determines whether a text contains a reasoning problem, whether that problem is **hidden** or **explicit**, and which explicit error types are present.
+> RQA is a judge, not a teacher and not a generator.
+---
+## What Is New in R2 Compared to R1
+R2 is not just a retrain of R1. It is a full methodological upgrade.
+### Core differences
+- **R1** used a more limited 2-signal setup.
+- **R2** uses a strict **3-head ontology**:
+  - `has_issue`
+  - `is_hidden`
+  - `error_types`
+### Key improvements in R2
+- explicit hidden-problem modeling instead of weaker implicit logic
+- strict `logical / hidden / explicit` inference contract
+- honest `train / val / calib / test` split
+- separate calibration split for temperatures and thresholds
+- per-class thresholds for error types
+- uncertainty-aware inference with `status=uncertain` and `review_required`
+- duplicate and conflict-duplicate filtering in the loader
+- truncation audit and richer evaluation reports
+- better optimizer setup for transformer fine-tuning
+- staged encoder fine-tuning with freeze/unfreeze
+- stronger schema/version safety for inference artifacts
+In short:
+> **R1** was a strong prototype.
+> **R2** is the first version that behaves like a full training + calibration + inference pipeline.
+---
+## What Problem RQA-R2 Solves
+Texts written by humans or LLMs can:
+- sound coherent
+- use correct vocabulary
+- appear persuasive
+...while still containing **reasoning problems** that are:
+- subtle
+- structural
+- hidden in argumentation
+RQA-R2 focuses specifically on **reasoning quality**, not on style, grammar, sentiment, or factual correctness.
+---
+## Model Overview
+| Property | Value |
+|---|---|
+| Model Type | Judge / Evaluator |
+| Base Encoder | [XLM-RoBERTa Large](https://huggingface.co/FacebookAI/xlm-roberta-large) |
+| Pooling | Mean pooling |
+| Heads | 3 (`has_issue`, `is_hidden`, `error_types`) |
+| Language | Russian |
+| License | MIT |
+---
+## What the Model Predicts
+RQA-R2 predicts three connected outputs.
+### 1. Logical Issue Detection
+- `has_logical_issue ∈ {false, true}`
+- calibrated probability available
+### 2. Hidden Problem Detection
+- `is_hidden_problem ∈ {false, true}`
+- evaluated only when a reasoning issue exists
+### 3. Explicit Error Type Classification
+If the text is classified as `explicit`, the model may assign one or more of the following error types:
+- `false_causality`
+- `unsupported_claim`
+- `overgeneralization`
+- `missing_premise`
+- `contradiction`
+- `circular_reasoning`
+This is a **multi-label** prediction head.
+---
+## Ontology
+R2 uses a strict three-class reasoning ontology.
+### `logical`
+- no reasoning issue
+- no hidden problem
+- no explicit errors
+### `hidden`
+- reasoning problem exists
+- no explicit labeled fallacy
+- the issue is structural, implicit, or argumentative
+### `explicit`
+- reasoning problem exists
+- at least one explicit error type is present
+This ontology is enforced in both training and inference.
+---
+## Inference Contract
+RQA-R2 uses gated inference:
+- if `has_issue = false` -> class is `logical`, no errors are returned
+- if `has_issue = true` and `is_hidden = true` -> class is `hidden`, no explicit errors are returned
+- if `has_issue = true` and `is_hidden = false` -> class is `explicit`, explicit errors may be returned
+R2 also supports:
+- calibrated thresholds
+- `uncertain` mode
+- `review_required` for borderline cases
+---
+## Architecture
+RQA-R2 is built on top of **XLM-RoBERTa Large** with:
+- mean pooling
+- separate projections per task
+- separate dropout per head
+- 3 task-specific heads
+- uncertainty-weighted multi-task training
+Training is hierarchical:
+- `has_issue` is trained on all samples
+- `is_hidden` is trained only on problem samples
+- `error_types` are trained only on explicit samples
+---
+## Training and Calibration
+R2 uses an honest experimental structure:
+- `train` for fitting
+- `val` for model selection
+- `calib` for temperature scaling and threshold tuning
+- `test` for final held-out evaluation
+Calibration includes:
+- issue temperature
+- hidden temperature
+- per-class error temperatures
+- threshold selection for `has_issue`
+- threshold selection for `is_hidden`
+- per-class thresholds for error types
+---
+## Held-Out Synthetic Benchmark
+The following metrics were obtained on the current held-out synthetic test split used for R2:
+- `Issue`: `F1 = 0.988`, `FPR = 0.029`, `PR-AUC = 0.999`
+- `Hidden`: `F1 = 0.960`, `PR-AUC = 0.994`
+- `Errors`: `macro-F1 = 0.822`, `micro-F1 = 0.813`, `samples-F1 = 0.838`
+- `Top-level class macro-F1 = 0.964`
+- `Coverage = 95.6%`
+- `Uncertain rate = 4.4%`
+These are strong results for the current data regime.
+Important:
+> These metrics are measured on a held-out split from the current synthetic dataset.
+> They demonstrate that the R2 design works very well in-distribution, but they should not be interpreted as proof of universal real-world reasoning performance.
+---
+## Training Data
+RQA-R2 was trained on a custom reasoning-quality dataset with:
+- `7292` total samples
+- `3150` logical texts
+- `4142` problematic texts
+- `1242` hidden problems
+- `2900` explicit cases
+Error-label counts:
+- `false_causality`: `518`
+- `unsupported_claim`: `524`
+- `overgeneralization`: `599`
+- `missing_premise`: `537`
+- `contradiction`: `475`
+- `circular_reasoning`: `540`
+Multi-label explicit cases:
+- `293`
+The current dataset is useful and already strong enough for training and benchmarking R2, but it is still primarily **synthetic** and should be expanded with real-world data in future versions.
+---
+## Intended Use
+### Recommended for
+- reasoning-quality evaluation
+- LLM output auditing
+- AI safety pipelines
+- judge/reranker pipelines
+- pre-filtering for downstream review
+- analytical tooling around argument structure
+### Not intended for
+- text generation
+- explanation generation
+- automatic rewriting or correction
+- factual verification
+- legal or scientific truth adjudication
+---
+## Output Example
+```json
+{
+  "class": "explicit",
+  "status": "ok",
+  "review_required": false,
+  "has_logical_issue": true,
+  "has_issue_probability": 0.9993,
+  "is_hidden_problem": false,
+  "hidden_probability": 0.021,
+  "errors": [
+    {
+      "type": "missing_premise",
+      "probability": 0.923,
+      "threshold": 0.54
+    }
+  ]
+}
+```
+---
+## Limitations
+RQA-R2 still has important limits:
+- it evaluates reasoning structure, not factual truth
+- hidden problems remain partly subjective by nature
+- the current benchmark is still synthetic and in-distribution
+- real human-written texts and outputs from other LLMs may be harder
+- the model should still be validated externally before being treated as a fully general reasoning judge
+Also note:
+- R2 was optimized toward low false positives, but on the current held-out synthetic test set the observed `Issue FPR` is `2.9%`, not `1.0%`
+- if strict false-positive control is critical, threshold tuning may need to be tightened further for the target deployment environment
+---
+## Recommended Next Step
+The best next step after R2 is external validation on:
+- human-written argumentative texts
+- outputs from other LLM families
+- paraphrased and adversarially reworded samples
+- harder hidden-problem cases
+That is the correct way to turn a strong in-distribution result into a robust real-world system.
+---
+## Summary
+RQA-R2 is a major upgrade over R1:
+- better ontology
+- better training logic
+- better calibration
+- better inference safety
+- stronger held-out synthetic performance
+R1 proved the idea.
+**R2 is the first version that fully validates it.**