Model Card for LLaMat-2-Chat

Overview

LLaMat-2-Chat is a specialized large language model designed to serve as an AI copilot for materials research. Finetuned from LLaMat-2, this model is adapted for tasks such as information extraction from material science text and tabular data. It provides advanced capabilities in scientific data processing, assisting researchers in analyzing and interpreting material science literature, reports, and datasets.

For more details, refer to our paper: Foundational Large Language Models for Materials Research.

Model Details

Model Type: Large Language Model (LLM)
Base Model: LLaMat-2 (continued pretraining of LLaMA-2 on material science data)
Language: English
License: LLaMA-2 License
Tags: Material Science, Domain Adaptation, Table Understanding, Scientific Data Parsing, Materials Copilot
Developed by: M3RG, IIT Delhi & DAIR, IIT Delhi

Key Features

Instruction Following Abilities: Optimized for understanding and processing instructions in the material science domain.
Domain-Specific Expertise: Pretrained on material science tokens, enabling high performance in scientific applications.
Applications: information extraction, table understanding, and parsing data for research tasks.

Intended Use

LLaMat-2-Chat is designed to assist researchers, scientists, and industry professionals in:

Extracting structured information from material science texts and tables.
Analyzing experimental results and processing large datasets.
Assisting in literature review and knowledge discovery.
Supporting research-driven natural language queries related to material science.

This model is intended for academic and industrial research purposes.

Technical Specification

Hardware Infrastructure

Pretraining: 2 Cerebras CS-2 Wafer-Scale Engines (WSE-2)
Finetuning: 8 NVIDIA A100 80GB GPUs
Inferencing: 1 NVIDIA A100 80GB GPU

Software Stack

Frameworks: PyTorch, Hugging Face Transformers, Meditron-LLM Library

Training Data

LLaMat-2-Chat was trained on a curated corpus of material science literature, scientific papers, structured datasets, and technical reports. The training set includes:

material science research papers published in journals of Elsevier and Springer.
Material science community discourse
Redpajama dataset
Openorca instruction finetuning dataset
mathQA dataset
MatSciNLP benchmark dataset
task specific datasets (mentioned in Table A.2 in Foundational Large Language Models for Materials Research.)

Results

detailed results and comparison with existing models can be read from Foundational Large Language Models for Materials Research.

Development and Support

Developed by: M3RG, IIT Delhi & DAIR, IIT Delhi
Compute Support:
- IIT Delhi High-Performance Computing Cluster: Supported fine-tuning and inference stages.
- Edinburgh International Data Facility (EIDF): EIDF Cerebras CS Clusters provided access to Cerebras CS2 clusters for pretraining.

Repository with training and evaluation code

Repository: LLaMat-2 on GitHub

Citation

If you use LLaMat-2-Chat in your research, please cite our work:

@article{LLaMat-2,
  author    = {Vaibhav Mishra and Somaditya Singh and Dhruv Ahlawat and Mohd Zaki and Vaibhav Bihani and Hargun Singh Grover and Biswajit Mishra and Santiago Miret and Mausam and N. M. Anoop Krishnan},
  title     = {Foundational Large Language Models for Materials Research},
  journal   = {arXiv preprint arXiv:2412.09560},
  year      = {2024},
  url       = {https://arxiv.org/abs/2412.09560}
}

m3rg-iitd
/

llamat-2-chat