--- language: en tags: - transformers - feature-extraction - materials license: other --- # MaterialsBERT This model is a fine-tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) on a dataset of 2.4 million materials science abstracts. It was introduced in [this](https://www.nature.com/articles/s41524-023-01003-w) paper. This model is uncased. ## Model description Domain-specific fine-tuning has been [shown](https://arxiv.org/abs/2007.15779) to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine-tunes PubMedBERT, a pre-trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine-tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets. ## Intended uses & limitations You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on materials-science relevant downstream tasks. Note that this model is primarily aimed at being fine-tuned on tasks that use a sentence or a paragraph (potentially masked) to make decisions, such as sequence classification, token classification or question answering. ## How to Use Here is how to use this model to get the features of a given text in PyTorch: ```python from transformers import BertForMaskedLM, BertTokenizer tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT') model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT') text = "Enter any text you like" encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) ``` ## Training data A fine-tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3.0 - mixed_precision_training: Native AMP ### Framework versions - Transformers 4.17.0 - Pytorch 1.10.2 - Datasets 1.18.3 - Tokenizers 0.11.0 ## Citation If you find MaterialsBERT useful in your research, please cite the following paper: ```latex @article{materialsbert, title={A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing}, author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi}, journal={npj Computational Materials}, volume={9}, number={1}, pages={52}, year={2023}, publisher={Nature Publishing Group UK London} } ```