Papers
arxiv:2402.06852

ChemLLM: A Chemical Large Language Model

Published on Feb 10
· Featured in Daily Papers on Feb 13
Authors:
,
,
,
,
,
,
,

Abstract

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

Community

Hi! Is there any plan to release datasets and ChemBench?

·
Paper author

https://huggingface.co/datasets/AI4Chem/ChemPref-DPO-for-Chemistry-data-en
More Data will be available in weeks, All hail to open source community!

Paper author
edited 28 days ago

Hi! Is there any plan to release datasets and ChemBench?

ChemLLM datasets is all open source now!
https://huggingface.co/papers/2402.06852
700K of SFT Dataset, ChemData700K For Chemistry of LLM!
https://huggingface.co/datasets/AI4Chem/ChemData700K
10K of DPO Dataset, ChemPref-10K, both English and Chinese!
https://huggingface.co/datasets/AI4Chem/ChemPref-DPO-for-Chemistry-data-en
https://huggingface.co/datasets/AI4Chem/ChemPref-DPO-for-Chemistry-data-cn
ChemBench-4K of 4100 high-quality single-choice benchmark for nine core Chemistry tasks!
https://huggingface.co/datasets/AI4Chem/ChemBench4K
C-MHChem, 600 real test questions written and checked manually, from 25 years of Chinese National Middle school chemistry Test!
https://huggingface.co/datasets/AI4Chem/C-MHChem-Benchmark-Chinese-Middle-high-school-Chemistry-Test
All hail to Open-source community!🤗

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author
This comment has been hidden

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.06852 in a Space README.md to link it from this page.

Collections including this paper 6