Edit model card

SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

📃 [SciGLM] [GitHub]

SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated SciInstruct, a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.

SciInstruct

We construct the SciInstruct as follows:

Subject Math Physics& Chemistry Formal Proofs (Lean) Total
# Number 89,934 123,869 40,248 254,051

We release our data and model for public use. If you wish to use SciInstruct or SciGLM, you can download them from the following links.

Download data: [Google Drive] [Tsinghua Cloud]

Download model: [Hugging Face]

Training & Inference

Fine-tuning

You can use the SciGLM model through Huggingface's Transformers library.

git clone https://github.com/THUDM/SciGLM.git
cd SciGLM
pip install -r requirements.txt

To train the 6B model, run:

bash /path/training/finetune.sh

Inference

cd /path/to/inference
python cli_demo.py

Citation

If you find our work helpful, please kindly cite our paper:

@misc{zhang2024sciglm,
      title={SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning}, 
      author={Dan Zhang and Ziniu Hu and Sining Zhoubian and Zhengxiao Du and Kaiyu Yang and Zihan Wang and Yisong Yue and Yuxiao Dong and Jie Tang},
      year={2024},
      eprint={2401.07950},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
19
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train zd21/SciGLM-6B