Llama3-KALE-LM-Chem-8B

Introduction

We are thrilled to present Llama3-KALE-LM-Chem 8B, our first open-source KALE-LM, which specializes in chemistry.

Training Details

We have continually pre-trained the model with a large amount of data and post-trained it through supervised fine-tuning.

Benchmarks

Open Benchmarks

Models	ChemBench	MMLU	MMLU-Chem	SciQ	IE(Acc)	IE(LS)
GPT-3.5	47.15	69.75	53.32	89.6	52.98	68.28
GPT-4	53.72	78.67	63.70	94.10	54.20	69.74
Llama3-8B-Instruct	46.02	68.3	51.10	93.30	45.83	61.22
LlaSMol	28.47	54.47	33.24	72.30	2.16	3.23
ChemDFM	44.44	58.11	45.60	86.70	7.61	11.49
ChemLLM-7B-Chat	34.16	61.79	48.39	94.00	29.66	39.17
ChemLLM-7B-Chat-1.5-SFT	42.75	63.56	49.63	95.10	14.96	19.61
Llama3-KALE-LM-Chem-8B	52.40	68.74	53.83	91.50	67.50	78.37

ChemBench Details (Evaluated By OpenCompass)

Models	NC	PP	M2C	C2M	PP	RS	YP	TP	SP	Average
GPT-3.5	46.93	56.98	85.28	38.25	43.67	42.33	30.33	42.57	38	47.15
GPT-4	54.82	65.02	92.64	52.88	62.67	52.67	42.33	24.75	35.67	53.72
Llama3-8B-Instruct	51.31	27.79	90.30	40.88	34.00	30.00	45.33	60.89	33.67	46.02
LlaSMol	27.78	29.34	31.44	23.38	25.67	24.00	37.33	34.65	22.67	28.47
ChemDFM	36.92	55.57	83.95	42.00	40.00	37.33	39.00	33.17	32.00	44.44
ChemLLM-7B-Chat	41.05	29.76	85.28	26.12	26.00	24.00	20.00	24.26	31.00	34.16
ChemLLM-7B-Chat-1.5-SFT	50.06	49.51	85.28	38.75	38.00	26.67	28.33	31.68	33.67	42.44
Llama3-KALE-LM-Chem-8B	63.58	58.39	92.98	44.50	48.67	38.33	46.33	44.55	34.33	52.41

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "USTC-KnowledgeComputingLab/Llama3-KALE-LM-Chem-8B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("USTC-KnowledgeComputingLab/Llama3-KALE-LM-Chem-8B")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Cite This Work

@article{dai2024kale,
  title={KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model},
  author={Dai, Weichen and Chen, Yezeng and Dai, Zijie and Huang, Zhijie and Liu, Yubo and Pan, Yixuan and Song, Baiyang and Zhong, Chengli and Li, Xinhe and Wang, Zeyu and others},
  journal={arXiv preprint arXiv:2409.18695},
  year={2024}
}

USTC-KnowledgeComputingLab
/

Llama3-KALE-LM-Chem-8B