SciJudge-4B-2605

SciJudge-4B-2605 is a Qwen3-4B-Instruct-2507 model fine-tuned for scientific paper evaluation. Given two papers' titles, abstracts, and publication dates, it predicts which paper has higher citation impact.

This release is part of AI Can Learn Scientific Taste. The companion larger model is SciJudge-30B-2605, and the benchmark dataset is SciJudgeBench.

Resources: Project page and GitHub repository.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OpenMOSS-Team/SciJudge-4B-2605"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant. You first think about the reasoning process in your mind and then provide the user with the answer."},
    {"role": "user", "content": "Today is 2025-12-10. Based on the titles, abstracts, and publication dates of the following two papers A and B, determine which paper has a higher citation count.\nShow your reasoning process in <reason> </reason> tags. And return the final answer in <answer> </answer> tags. The final answer should contain only 'A' or 'B'.\n\nPaper A:\nTitle: ...\nAbstract: ...\nDate: ...\n\nPaper B:\nTitle: ...\nAbstract: ...\nDate: ..."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7, top_p=0.8, top_k=20)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Training Details

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Training method: GRPO with DAPO loss
  • Reward: external preference reward for citation-based pairwise judgment
  • Training data: 720,341 preference pairs from SciJudgeBench
  • Precision: bfloat16
  • KL coefficient: 0.03

Evaluation

Accuracy on the SciJudgeBench test split, the 1,000-example MAIN_1000 in-domain evaluation set:

Model CS Math Physics Others Avg.
Qwen3-4B-Instruct-2507 58.06 71.08 51.76 55.77 58.1
SciJudge-4B-2605 78.63 82.84 74.12 75.48 77.3

Citation

@misc{tong2026ailearnscientifictaste,
      title={AI Can Learn Scientific Taste}, 
      author={Jingqi Tong and Mingzhe Li and Hangcheng Li and Yongzhuo Yang and Yurong Mou and Weijie Ma and Zhiheng Xi and Hongji Chen and Xiaoran Liu and Qinyuan Cheng and Ming Zhang and Qiguang Chen and Weifeng Ge and Qipeng Guo and Tianlei Ying and Tianxiang Sun and Yining Zheng and Xinchi Chen and Jun Zhao and Ning Ding and Xuanjing Huang and Yugang Jiang and Xipeng Qiu},
      year={2026},
      eprint={2603.14473},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.14473}, 
}
Downloads last month
220
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSS-Team/SciJudge-4B-2605

Finetuned
(1793)
this model
Quantizations
1 model

Dataset used to train OpenMOSS-Team/SciJudge-4B-2605

Collection including OpenMOSS-Team/SciJudge-4B-2605

Paper for OpenMOSS-Team/SciJudge-4B-2605

Evaluation results