YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DialFactSum: RoSE Benchmark Evaluation (SAMSum Subset)
This repository contains the evaluation results and model checkpoints for DialFactSum, a dialogue summarization framework optimized via ACU-driven Group Relative Policy Optimization (GRPO).
Our models are evaluated on the RoSE (Revisiting, Organizing, and Structuring Evaluation) benchmark (SAMSum subset), demonstrating state-of-the-art factual coverage and density.
π Key Improvements
- Strategic Token Reallocation: Unlike standard SFT models that suffer from over-truncation, DialFactSum-ACU-8B learns to expand sequence length strategically to capture more facts while maintaining high precision.
- State-of-the-Art Performance: Outperforms strong baselines (like CtrlβDiaSumm and MVβBART) across all factual metrics.
- Unified Evaluation: All results are reported using a unified GPT-4o (G-Eval) protocol for absolute fairness.
π Quantitative Results
We compare our DialFactSum-ACU-8B (Stage-2 GRPO) against its SFT base and several established sequence-to-sequence and LLM baselines.
| Model | ACU Recall β | ACU Prec. β | ACU F1 β | Norm. ACU β | Avg Words |
|---|---|---|---|---|---|
| DialFactSum-ACU-8B (GRPO) | 0.6929 | 0.5398 | 0.5685 | 0.4635 | 30.1 |
| DialFactSum-Base-8B (Stage-1 SFT) | 0.5095 | 0.5092 | 0.4593 | 0.4366 | 18.7 |
| Qwen3-8B (Standard SFT) | 0.5162 | 0.5170 | 0.4673 | 0.4402 | 18.6 |
| CtrlβDiaSumm | 0.5123 | 0.5318 | 0.4698 | 0.4391 | 19.6 |
| MVβBART | 0.5247 | 0.4914 | 0.4629 | 0.4315 | 20.1 |
| BART | 0.4494 | 0.4784 | 0.4119 | 0.3948 | 16.8 |
| PEGASUS | 0.3884 | 0.4564 | 0.3639 | 0.3571 | 14.7 |
| UniLM | 0.3346 | 0.4014 | 0.3101 | 0.2984 | 16.1 |
Note: Zero-Shot models (not shown in this table) exhibit high Recall but suffer from extreme verbosity and low Precision. DialFactSum-ACU-8B provides the best equilibrium between length and factual density.
πΌοΈ Visual Comparison
The following chart illustrates the performance of DialFactSum-ACU-8B compared to various baselines. Our model (Stage-2 GRPO) demonstrates a significant leap in both raw factual coverage (ACU Recall) and length-normalized density (Norm. ACU).
(Caption: Comparison of ACU and Normalized ACU scores across different models on the RoSE SAMSum test set.)
π Key Findings from Experiments
- Resolution of the SFT Bottleneck: Standard SFT models converge to a conservative length (~18 words), which limits their factual recall. Our GRPO policy escapes this "truncation trap" by learning to use ~30 words to encapsulate significantly more atomic facts without sacrificing precision.
- Superior Factual Consistency: DialFactSum-ACU-8B achieves the highest ACU F1 (0.5685) and Normalized ACU (0.4635), proving that the bidirectional ACU reward effectively mitigates hallucinations and structural errors.
- Quality Preservation: Beyond factual metrics, our model maintains high linguistic quality. Evaluation via UniEval shows that DialFactSum-ACU-8B improves in Coherence (0.9507) and Relevance (0.9041) compared to its SFT predecessor, circumventing the common "alignment tax" seen in reinforcement learning.
- Bidirectional Constraint Necessity: Ablation studies confirm that removing the backward verification (hallucination penalty) or length penalty leads to either ungrounded generation or reward hacking through verbosity.
π οΈ Model Details
- Base Model: Qwen3-8B
- Training Stages:
- Stage-1 SFT: Fine-tuned on distilled rationale trajectories.
- Stage-2 GRPO: Optimized with a composite reward function ($R_{ACU} + R_{len} + R_{BERT}$).
- Evaluation: GPT-4o G-Eval (T=0) for ACU parsing and verification.
- Downloads last month
- 12