Argument Quality Ranking - RoBERTa v3 (Best Model)

Fine-tuned RoBERTa-base for pairwise argument quality ranking using margin ranking loss and test-time pair flipping. Achieves 65.7% accuracy on both in-domain and cross-topic test sets, matching GPT-5.5 zero-shot (66.5%) with a 50x smaller model.

Model Details

  • Base model: roberta-base
  • Task: Pairwise argument quality classification (A wins / B wins)
  • Training data: IBM ArgQ corpus (3,587 pairs, 60 topics)
  • Input format: [CLS] topic [SEP] arg_a [SEP] arg_b
  • Inference: Test-time pair flipping (predict both orderings, average scores)

Key Improvements over v2

  • Margin ranking loss (margin=0.3) replaces cross-entropy, directly optimising the score gap between winner and loser
  • Test-time pair flipping eliminates positional bias at inference

Performance

Split Accuracy F1 Precision Recall
In-domain 65.7% 0.644 0.673 0.616
Cross-topic 65.7% 0.673 0.669 0.677

Zero generalization gap between in-domain and cross-topic -- the only model in our experiments to achieve this.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("SambhavSBU/argument-quality-roberta-v3")
model = AutoModelForSequenceClassification.from_pretrained("SambhavSBU/argument-quality-roberta-v3")

def predict(topic, arg_a, arg_b):
    def score(a, b):
        inp = tokenizer(topic + " [SEP] " + a, b,
                        return_tensors="pt", truncation=True, max_length=256)
        with torch.no_grad():
            logits = model(**inp).logits
        return (logits[0, 1] - logits[0, 0]).item()

    # test-time pair flipping: average both orderings
    margin = (score(arg_a, arg_b) - score(arg_b, arg_a)) / 2
    return "A" if margin > 0 else "B"

topic = "We should ban social media"
arg_a = "Social media spreads misinformation at an unprecedented scale."
arg_b = "Social media connects people across the world."
print(f"Higher quality argument: {predict(topic, arg_a, arg_b)}")

Citation

Code and full experiments: https://github.com/Sambhav101/Argument-Quality-Ranking

Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support