eXTC — ContractNLI (3-class legal NLI)

Anonymized artifact for a paper under double-blind review. Author identity and institution will be revealed at camera-ready.

This is the final-stage checkpoint of eXTC (eXplainable Text Classifier) for 3-way natural language inference over non-disclosure-agreement (NDA) clauses, from the ContractNLI benchmark.

  • Input: a contract clause paired with a hypothesis.
  • Label: entailment, contradiction, or not_mentioned.
  • Output: a free-text reasoning trace followed by a final LABEL: <label> line — the reasoning serves as a local, inspectable explanation of the prediction.

eXTC pipeline

eXTC is a three-stage explainable classifier. This checkpoint is the output of all three stages:

Qwen3-4B (base)
  │
  ├─ Stage I — SOP Learning (structured prompt optimization)
  │     A natural-language rulebook (Standard Operating Procedure) is learned
  │     via a structured prompt-optimization algorithm; used only to ground the
  │     teacher in Stage II (not present at inference).
  │
  ├─ Stage II — SOP-Grounded Reasoning Distillation (R-SFT)
  │     Teacher: gpt-4.1-mini, prompted with <SOP, input>, rejection sampling
  │     (M=4 traces/example, keep first trace whose label is correct).
  │     Student: Qwen3-4B fine-tuned with LoRA (r=64, alpha=128, 2 epochs) on the
  │     accepted reasoning+label traces, with class-balanced upsampling.
  │
  └─ Stage III — Beyond SOP via RL (BD-GRPO)
        Balanced Dynamic GRPO: per-class oversampling, then drop zero-advantage
        (homogeneous-rollout) groups and keep a class-balanced batch of
        informative groups, with a binary label-correctness reward.

The released checkpoint is the one with the best validation macro-F1 over the RL training trajectory, evaluated on the held-out test set under that selection.

Test metrics

ContractNLI 3-class test set (n=2091), greedy decoding (temperature=0):

Metric Value
Balanced accuracy 0.8824
Macro F1 0.8494
Accuracy 0.8871
Invalid output rate 0.001

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "extc-anon/extc-contractnli"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto")

prompt = (
    "Premise: The Receiving Party shall not disclose Confidential Information to "
    "any third party without prior written consent.\n"
    "Hypothesis: The Receiving Party may share Confidential Information with its "
    "external auditors without consent.\n\n"
    "Classify the hypothesis as entailment, contradiction, or not_mentioned. "
    "Provide your reasoning and then the label."
)
text = tok.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True, tokenize=False,
)
ids = tok(text, return_tensors="pt").input_ids.to(model.device)
out = model.generate(ids, max_new_tokens=1024, do_sample=False)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Format

  • Standard HuggingFace transformers (safetensors, bfloat16, ~7.5 GB).
  • Architecture: Qwen3ForCausalLM, 4.02B parameters.
  • Test numbers above use greedy decoding (do_sample=False).

License

Apache 2.0 (matches the Qwen3 base model).

Citation

Anonymous paper citation will be added at camera-ready.

Downloads last month
18
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for extc-anon/extc-contractnli

Finetuned
Qwen/Qwen3-4B
Finetuned
(701)
this model