Argument Quality Ranking - RoBERTa v3 (Best Model)

Fine-tuned RoBERTa-base for pairwise argument quality ranking using margin ranking loss and test-time pair flipping. Achieves 65.7% accuracy on both in-domain and cross-topic test sets, matching GPT-5.5 zero-shot (66.5%) with a 50x smaller model.

Model Details

Base model: roberta-base
Task: Pairwise argument quality classification (A wins / B wins)
Training data: IBM ArgQ corpus (3,587 pairs, 60 topics)
Input format: [CLS] topic [SEP] arg_a [SEP] arg_b
Inference: Test-time pair flipping (predict both orderings, average scores)

Key Improvements over v2

Margin ranking loss (margin=0.3) replaces cross-entropy, directly optimising the score gap between winner and loser
Test-time pair flipping eliminates positional bias at inference

Performance

Split	Accuracy	F1	Precision	Recall
In-domain	65.7%	0.644	0.673	0.616
Cross-topic	65.7%	0.673	0.669	0.677

Zero generalization gap between in-domain and cross-topic -- the only model in our experiments to achieve this.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("SambhavSBU/argument-quality-roberta-v3")
model = AutoModelForSequenceClassification.from_pretrained("SambhavSBU/argument-quality-roberta-v3")

def predict(topic, arg_a, arg_b):
    def score(a, b):
        inp = tokenizer(topic + " [SEP] " + a, b,
                        return_tensors="pt", truncation=True, max_length=256)
        with torch.no_grad():
            logits = model(**inp).logits
        return (logits[0, 1] - logits[0, 0]).item()

    # test-time pair flipping: average both orderings
    margin = (score(arg_a, arg_b) - score(arg_b, arg_a)) / 2
    return "A" if margin > 0 else "B"

topic = "We should ban social media"
arg_a = "Social media spreads misinformation at an unprecedented scale."
arg_b = "Social media connects people across the world."
print(f"Higher quality argument: {predict(topic, arg_a, arg_b)}")

Citation

Code and full experiments: https://github.com/Sambhav101/Argument-Quality-Ranking

Downloads last month: 25

Safetensors

Model size

0.1B params

Tensor type

F32