YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DialFactSum: RoSE Benchmark Evaluation (SAMSum Subset)

This repository contains the evaluation results and model checkpoints for DialFactSum, a dialogue summarization framework optimized via ACU-driven Group Relative Policy Optimization (GRPO).

Our models are evaluated on the RoSE (Revisiting, Organizing, and Structuring Evaluation) benchmark (SAMSum subset), demonstrating state-of-the-art factual coverage and density.

πŸš€ Key Improvements

  • Strategic Token Reallocation: Unlike standard SFT models that suffer from over-truncation, DialFactSum-ACU-8B learns to expand sequence length strategically to capture more facts while maintaining high precision.
  • State-of-the-Art Performance: Outperforms strong baselines (like Ctrl‑DiaSumm and MV‑BART) across all factual metrics.
  • Unified Evaluation: All results are reported using a unified GPT-4o (G-Eval) protocol for absolute fairness.

πŸ“Š Quantitative Results

We compare our DialFactSum-ACU-8B (Stage-2 GRPO) against its SFT base and several established sequence-to-sequence and LLM baselines.

Model ACU Recall ↑ ACU Prec. ↑ ACU F1 ↑ Norm. ACU ↑ Avg Words
DialFactSum-ACU-8B (GRPO) 0.6929 0.5398 0.5685 0.4635 30.1
DialFactSum-Base-8B (Stage-1 SFT) 0.5095 0.5092 0.4593 0.4366 18.7
Qwen3-8B (Standard SFT) 0.5162 0.5170 0.4673 0.4402 18.6
Ctrl‑DiaSumm 0.5123 0.5318 0.4698 0.4391 19.6
MV‑BART 0.5247 0.4914 0.4629 0.4315 20.1
BART 0.4494 0.4784 0.4119 0.3948 16.8
PEGASUS 0.3884 0.4564 0.3639 0.3571 14.7
UniLM 0.3346 0.4014 0.3101 0.2984 16.1

Note: Zero-Shot models (not shown in this table) exhibit high Recall but suffer from extreme verbosity and low Precision. DialFactSum-ACU-8B provides the best equilibrium between length and factual density.


πŸ–ΌοΈ Visual Comparison

The following chart illustrates the performance of DialFactSum-ACU-8B compared to various baselines. Our model (Stage-2 GRPO) demonstrates a significant leap in both raw factual coverage (ACU Recall) and length-normalized density (Norm. ACU).

Model Performance Comparison (Caption: Comparison of ACU and Normalized ACU scores across different models on the RoSE SAMSum test set.)


πŸ” Key Findings from Experiments

  1. Resolution of the SFT Bottleneck: Standard SFT models converge to a conservative length (~18 words), which limits their factual recall. Our GRPO policy escapes this "truncation trap" by learning to use ~30 words to encapsulate significantly more atomic facts without sacrificing precision.
  2. Superior Factual Consistency: DialFactSum-ACU-8B achieves the highest ACU F1 (0.5685) and Normalized ACU (0.4635), proving that the bidirectional ACU reward effectively mitigates hallucinations and structural errors.
  3. Quality Preservation: Beyond factual metrics, our model maintains high linguistic quality. Evaluation via UniEval shows that DialFactSum-ACU-8B improves in Coherence (0.9507) and Relevance (0.9041) compared to its SFT predecessor, circumventing the common "alignment tax" seen in reinforcement learning.
  4. Bidirectional Constraint Necessity: Ablation studies confirm that removing the backward verification (hallucination penalty) or length penalty leads to either ungrounded generation or reward hacking through verbosity.

πŸ› οΈ Model Details

  • Base Model: Qwen3-8B
  • Training Stages:
    1. Stage-1 SFT: Fine-tuned on distilled rationale trajectories.
    2. Stage-2 GRPO: Optimized with a composite reward function ($R_{ACU} + R_{len} + R_{BERT}$).
  • Evaluation: GPT-4o G-Eval (T=0) for ACU parsing and verification.
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support