Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MGE-LLMs/SteelBERT

SteelBERT was pre-trained based on DeBERTa using a corpus of approximately 4.2 million materials abstracts and 55,000 full-text steel articles (approximately 0.96 billion words). SteelBERT masked 15% of the tokens to create a self-supervised training task, known as pretraining via Masked Language Modeling (MLM), which is a universal and effective pretraining method for various NLP tasks. In this training task, SteelBERT was taught to predict masked words representation by adjusting parameters in various network layers. 95% of the corpus was allocated for training, and 5% was designated as the validation dataset, maintaining the same ratio as the original DeBERTa. Finally, the validation loss reached 1.158 after 840 hours of training .

We chose the DeBERTa structure to pretrain SteelBERT. DeBERTa introduces a paradigm shift in language representation models, offering a disentangled attention mechanism for handling of long-range dependencies critical to comprehending complex material interactions. While the original DeBERTa model possesses an extensive sub-word vocabulary, this abundance could increase noise during the tokenization of the materials training corpus, leading to variations in word splitting. Consequently, a specialized tokenizer was trained based on our training corpus to construct a vocabulary specific to the steel domain, utilizing the DeBERTa tokenizer. Despite the training corpus comprising only about 6.7% of the original DeBERTa model, we maintained a consistent vocabulary scale of 128,100 words to ensure the precise capture of latent knowledge.

There are 188 million parameters in SteelBERT and the model is constructed using 12 stacked Transformer encoders with each hidden layer incorporating 12 attention heads. We used the original DeBERTa code to train SteelBERT on our corpus with the similar configurations and size. We set a maximum sentence length of 512 tokens and trained the model until the training loss steps decreasing. The pre-training procedure of SteelBERT used 8 NVIDIA A100 40GB GPUs for 840 hours, with a batch size of 576 sequences.

Downloads last month
0
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Space using MGE-LLMs/SteelBERT 1