fermi-bert-1024 / README.md

Fix Books3 Link

6b15154 verified 2 months ago

6.5 kB

	---
	library_name: transformers
	tags:
	- bert
	license: apache-2.0
	language: en
	---

	<br><br>
	<p align="center">
	<a href="https://atomic-canyon.com/"><svg id="Layer_1" data-name="Layer 1" xmlns="http://www.w3.org/2000/svg" width="450" viewBox="0 0 548.18 92.96"> <defs> <style> .cls-1 { stroke-width: 0px; } </style> </defs> <g> <path class="cls-1" d="m144.69,56.27h-16.94l-2.54,6.14h-7.06l14.26-32.63h7.71l14.31,32.63h-7.2l-2.54-6.14Zm-2.4-5.82l-6.05-14.59-6.05,14.59h12.09Z"/> <path class="cls-1" d="m171.91,62.4h-6.74v-26.31h-12.37v-6h31.48v6h-12.37v26.31Z"/> <path class="cls-1" d="m203.99,63.05c-10.15,0-17.12-6.92-17.12-16.76s7.02-16.85,17.26-16.85,17.08,6.92,17.08,16.76-7.06,16.85-17.22,16.85Zm.05-27.83c-6.09,0-10.29,4.57-10.29,10.99s4.25,11.08,10.34,11.08,10.29-4.57,10.29-10.99-4.29-11.08-10.34-11.08Z"/> <path class="cls-1" d="m261.22,37.57l-10.11,24.88h-6.46l-10.02-24.79-1.48,24.74h-6.51l2.12-32.31h9.32l9.88,24.6,9.97-24.6h9.28l2.03,32.31h-6.65l-1.38-24.83Z"/> <path class="cls-1" d="m276.67,62.4V30.09h6.74v32.31h-6.74Z"/> <path class="cls-1" d="m307.73,62.96c-10.48,0-17.86-6.46-17.86-16.71s7.75-16.71,17.72-16.71c4.43,0,8.12.88,11.59,2.31l-1.52,5.91c-2.95-1.29-6.23-2.22-9.79-2.22-6.55,0-11.12,4.34-11.12,10.62,0,6.6,4.62,10.8,11.45,10.8,3.18,0,6.37-.79,9.6-2.22l1.52,5.63c-3.6,1.71-7.57,2.58-11.59,2.58Z"/> <path class="cls-1" d="m351.39,62.96c-10.48,0-17.86-6.46-17.86-16.71s7.75-16.71,17.72-16.71c4.43,0,8.12.88,11.59,2.31l-1.52,5.91c-2.95-1.29-6.23-2.22-9.79-2.22-6.55,0-11.12,4.34-11.12,10.62,0,6.6,4.62,10.8,11.45,10.8,3.18,0,6.37-.79,9.6-2.22l1.52,5.63c-3.6,1.71-7.57,2.58-11.59,2.58Z"/> <path class="cls-1" d="m392.56,56.27h-16.94l-2.54,6.14h-7.06l14.26-32.63h7.71l14.31,32.63h-7.2l-2.54-6.14Zm-2.4-5.82l-6.05-14.59-6.05,14.59h12.09Z"/> <path class="cls-1" d="m413.92,39.88v22.52h-6.55V30.09h7.25l16.62,22.2v-22.2h6.51v32.31h-6.92l-16.89-22.52Z"/> <path class="cls-1" d="m462.61,62.4h-6.69v-12.14l-13.66-20.17h7.57l9.51,14.4,9.46-14.4h7.48l-13.66,20.12v12.19Z"/> <path class="cls-1" d="m494.32,63.05c-10.16,0-17.12-6.92-17.12-16.76s7.02-16.85,17.26-16.85,17.08,6.92,17.08,16.76-7.06,16.85-17.22,16.85Zm.05-27.83c-6.09,0-10.29,4.57-10.29,10.99s4.25,11.08,10.34,11.08,10.29-4.57,10.29-10.99-4.29-11.08-10.34-11.08Z"/> <path class="cls-1" d="m524.36,39.88v22.52h-6.55V30.09h7.25l16.62,22.2v-22.2h6.51v32.31h-6.92l-16.89-22.52Z"/> </g> <path class="cls-1" d="m66.85,4.93l-3.14,5.24s-15.72-8.38-34.06,0c0,0-23.06,9.43-22.53,38.25,0,0-.26,16.51,15.46,29.61,0,0,17.29,15.2,40.87,4.19l3.67,5.76s-22.53,13.62-48.73-4.19c0,0-17.55-13.36-18.34-35.63C.04,48.16-2.05,21.96,23.62,5.71c0,0,19.65-12.31,43.23-.79Z"/> <path class="cls-1" d="m70,13.84l3.67-5.76s17.29,11,18.86,30.39c0,0,7.07,26.46-18.6,46.37l-27.25-46.11-9.43,16.77h11l3.67,6.81h-26.2l20.96-36.68,28.3,48.21s11-6.81,12.05-27.77c0,0,1.05-20.44-17.03-32.23Z"/></svg></a>
	</p>
	<br><br>

	# fermi-bert-1024: Pretrained BERT for Nuclear Power

	A BERT model optimized for the nuclear energy domain, `fermi-bert-1024` is pretrained on a combination of Wikipedia (2023), Books3, and a subset of the U.S. Nuclear Regulatory Commission’s ADAMS database. It is specifically designed to handle the complex technical jargon and regulatory language unique to the nuclear industry. Trained on the Oak Ridge National Laboratory [Frontier supercomputer](https://www.olcf.ornl.gov/frontier/) using 128 MI250X AMD GPUs over a 10-hour period, this model provides a robust foundation for fine-tuning in nuclear-related applications.

	## Training

	`fermi-bert-1024` is a BERT model pretrained on `wikipedia (2023)`, [Books3](https://arxiv.org/pdf/2201.07311), and [ADAMS](https://www.nrc.gov/reading-rm/adams.html) with a max sequence length of 1024.

	We make several modifications to the standard BERT training procedure:
	* We use a custom nuclear-optimized WordPiece tokenizer to better represent the unique jargon and technical terminology specific to the nuclear industry.
	* We train on a subset of U.S. Nuclear Regulatory Commission’s Agency-wide Documents Access and Management System (ADAMS).
	* We train on Books3 rather than BookCorpus.
	* We use larger batch size and other improved hyper parameters as described in [RoBERTa](https://arxiv.org/abs/1907.11692).

	## Evaluation

	We evaluate the quality of fermi-bert-1024 on the standard [GLUE](https://gluebenchmark.com/) benchmark ([script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py)). We find it performs comparably to other BERT models but with the advantage of performing better on documents in the nuclear energy space as demonstrated by our downstream [fine-tuning](https://huggingface.co/atomic-canyon/fermi-bert-1024).

	\| Model \| Bsz \| Steps \| Seq \| Avg \| Cola \| SST2 \| MRPC \| STSB \| QQP \| MNLI \| QNLI \| RTE \|
	\| ----------------- \| --- \| ----- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \|
	\| bert-base-uncased \| 256 \| 1M \| 512 \| 0.81 \| 0.56 \| 0.82 \| 0.86 \| 0.88 \| 0.91 \| 0.84 \| 0.91 \| 0.67 \|
	\| roberta-base \| 8K \| 500k \| 512 \| 0.84 \| 0.56 \| 0.94 \| 0.88 \| 0.90 \| 0.92 \| 0.88 \| 0.92 \| 0.74 \|
	\| fermi-bert-512 \| 4k \| 100k \| 512 \| 0.83 \| 0.60 \| 0.93 \| 0.88 \| 0.89 \| 0.91 \| 0.87 \| 0.91 \| 0.68 \|
	\| fermi-bert-1024 \| 4k \| 100k \| 1024 \| 0.83 \| 0.6 \| 0.93 \| 0.86 \| 0.89 \| 0.91 \| 0.86 \| 0.92 \| 0.69 \|

	### Pretraining Data

	We train on 40% Wikipedia, 30% Books3, 30% ADAMS. We pack and tokenize the sequences to 1024 tokens. If a document is shorter than 1024 tokens, we append another document until it is 1024 tokens. If a document is longer than 1024 tokens we split it into multiple documents. For 10% of the Wikipedia documents, we do not concatenate short documents. See [M2-Bert](https://arxiv.org/pdf/2402.07440v2) for rationale behind including short documents.

	# Usage

	```python
	from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline

	tokenizer = AutoTokenizer.from_pretrained('atomic-canyon/fermi-bert-1024') # `fermi-bert` uses a nuclear specific tokenizer
	model = AutoModelForMaskedLM.from_pretrained('atomic-canyon/fermi-bert-1024')

	# To use this model directly for masked language modeling
	classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer, device="cpu")

	print(classifier("I [MASK] to the store yesterday."))
	```

	# Acknowledgement

	This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.