DialFactSum: RoSE Benchmark Evaluation (SAMSum Subset)

This repository contains the evaluation results and model checkpoints for DialFactSum, a dialogue summarization framework optimized via ACU-driven Group Relative Policy Optimization (GRPO).

Our models are evaluated on the RoSE (Revisiting, Organizing, and Structuring Evaluation) benchmark (SAMSum subset), demonstrating state-of-the-art factual coverage and density.

🚀 Key Improvements

Strategic Token Reallocation: Unlike standard SFT models that suffer from over-truncation, DialFactSum-ACU-8B learns to expand sequence length strategically to capture more facts while maintaining high precision.
State-of-the-Art Performance: Outperforms strong baselines (like Ctrl‑DiaSumm and MV‑BART) across all factual metrics.
Unified Evaluation: All results are reported using a unified GPT-4o (G-Eval) protocol for absolute fairness.

📊 Quantitative Results

We compare our DialFactSum-ACU-8B (Stage-2 GRPO) against its SFT base and several established sequence-to-sequence and LLM baselines.

Model	ACU Recall ↑	ACU Prec. ↑	ACU F1 ↑	Norm. ACU ↑	Avg Words
DialFactSum-ACU-8B (GRPO)	0.6929	0.5398	0.5685	0.4635	30.1
DialFactSum-Base-8B (Stage-1 SFT)	0.5095	0.5092	0.4593	0.4366	18.7
Qwen3-8B (Standard SFT)	0.5162	0.5170	0.4673	0.4402	18.6
Ctrl‑DiaSumm	0.5123	0.5318	0.4698	0.4391	19.6
MV‑BART	0.5247	0.4914	0.4629	0.4315	20.1
BART	0.4494	0.4784	0.4119	0.3948	16.8
PEGASUS	0.3884	0.4564	0.3639	0.3571	14.7
UniLM	0.3346	0.4014	0.3101	0.2984	16.1

Note: Zero-Shot models (not shown in this table) exhibit high Recall but suffer from extreme verbosity and low Precision. DialFactSum-ACU-8B provides the best equilibrium between length and factual density.

🖼️ Visual Comparison

The following chart illustrates the performance of DialFactSum-ACU-8B compared to various baselines. Our model (Stage-2 GRPO) demonstrates a significant leap in both raw factual coverage (ACU Recall) and length-normalized density (Norm. ACU).

(Caption: Comparison of ACU and Normalized ACU scores across different models on the RoSE SAMSum test set.)

🔍 Key Findings from Experiments

Resolution of the SFT Bottleneck: Standard SFT models converge to a conservative length (~18 words), which limits their factual recall. Our GRPO policy escapes this "truncation trap" by learning to use ~30 words to encapsulate significantly more atomic facts without sacrificing precision.
Superior Factual Consistency: DialFactSum-ACU-8B achieves the highest ACU F1 (0.5685) and Normalized ACU (0.4635), proving that the bidirectional ACU reward effectively mitigates hallucinations and structural errors.
Quality Preservation: Beyond factual metrics, our model maintains high linguistic quality. Evaluation via UniEval shows that DialFactSum-ACU-8B improves in Coherence (0.9507) and Relevance (0.9041) compared to its SFT predecessor, circumventing the common "alignment tax" seen in reinforcement learning.
Bidirectional Constraint Necessity: Ablation studies confirm that removing the backward verification (hallucination penalty) or length penalty leads to either ungrounded generation or reward hacking through verbosity.

🛠️ Model Details

Base Model: Qwen3-8B
Training Stages:
1. Stage-1 SFT: Fine-tuned on distilled rationale trajectories.
2. Stage-2 GRPO: Optimized with a composite reward function ($R_{ACU} + R_{len} + R_{BERT}$).
Evaluation: GPT-4o G-Eval (T=0) for ACU parsing and verification.

Downloads last month: 12

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support