Improve model card: Add pipeline tag, library name, paper link, abstract, and update GitHub/citation

This PR significantly improves the model card for `TMLR-Group-HF/Entropy-Qwen3-4B-Base` by:

* Adding `pipeline_tag: text-generation` to ensure it is discoverable under text generation models.
* Adding `library_name: transformers` to enable the "Use in Transformers" widget and provide a clear usage indicator.
* Linking directly to the Hugging Face paper: [Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models](https://huggingface.co/papers/2508.00410).
* Including the full paper abstract to provide comprehensive context on the model's background and methodology.
* Updating the GitHub repository link to the correct casing and formatting it as a proper Markdown link.
* Updating the BibTeX citation to match the one provided in the official GitHub repository.

These updates will make the model more accessible, discoverable, and informative for researchers and developers on the Hugging Face Hub.

Files changed (1) hide show

README.md +16 -10

README.md CHANGED Viewed

@@ -1,19 +1,25 @@
 ---
 license: mit
 ---
-## TMLR-Group-HF/Entropy-Qwen3-4B-Base
-This is the Qwen3-4B-Base model trained by Entropy Minimization method using MATH training set.
-If you are interested in Co-Reward, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-Reward].
-## Citation
-```
-@article{zhang2025coreward,
-      title={Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement},
-      author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
-      journal={arXiv preprint arXiv:2508.00410}
-      year={2025},
 }
 ```

 ---
 license: mit
+pipeline_tag: text-generation
+library_name: transformers
 ---
+# Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
+This is the **TMLR-Group-HF/Entropy-Qwen3-4B-Base** model. It is the Qwen3-4B-Base model trained by the Entropy Minimization method using the MATH training set, as presented in the paper [Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models](https://huggingface.co/papers/2508.00410).
+For more details on the Co-rewarding framework, code, and other checkpoints, please refer to the official GitHub repository: [https://github.com/tmlr-group/Co-rewarding](https://github.com/tmlr-group/Co-rewarding).
+## Abstract
+While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose *Co-rewarding*, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) *Co-rewarding-I* is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) *Co-rewarding-II* is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by +3.31% improvements on average on multiple mathematical reasoning benchmarks, especially by +7.49% on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@1 of 94.01% on GSM8K with Qwen3-8B-Base remarkably higher than GT.
+## Citation
+If you use our datasets or models, please cite our paper!
+```bibtex
+@article{zhang2025co,
+  title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
+  author={Zhang, Zizhuo and Zhu, Jianing and Ge, Xinmu and Zhao, Zihua and Zhou, Zhanke and Li, Xuan and Feng, Xiao and Yao, Jiangchao and Han, Bo},
+  journal={arXiv preprint arXiv:2508.00410},
+  year={2025}
 }
 ```