metadata

license: apache-2.0
datasets:
  - wikimedia/wikipedia
  - bookcorpus
  - nomic-ai/nomic-bert-2048-pretraining-data
language:
  - en
inference: false

nomic-bert-2048: A 2048 Sequence Length Pretrained BERT

nomic-bert-2048 is a BERT model pretrained on wikipedia and bookcorpus with a max sequence length of 2048.

We make several modifications to our BERT training procedure similar to MosaicBERT. Namely, we add:

Use Rotary Position Embeddings to allow for context length extrapolation.
Use SwiGLU activations as it has been shown to improve model performance
Set dropout to 0

We evaluate the quality of nomic-bert-2048 on the standard GLUE benchmark. We find it performs comparably to other BERT models but with the advantage of a significantly longer context length.

Model	Bsz	Steps	Seq	Avg	Cola	SST2	MRPC	STSB	QQP	MNLI	QNLI	RTE
NomicBERT	4k	100k	2048	0.84	0.50	0.93	0.88	0.90	0.92	0.86	0.92	0.82
RobertaBase	8k	500k	512	0.86	0.64	0.95	0.90	0.91	0.92	0.88	0.93	0.79
JinaBERTBase	4k	100k	512	0.83	0.51	0.95	0.88	0.90	0.81	0.86	0.92	0.79
MosaicBERT	4k	178k	128	0.85	0.59	0.94	0.89	0.90	0.92	0.86	0.91	0.83

Pretraining Data

We use BookCorpus and a 2023 dump of wikipedia. We pack and tokenize the sequences to 2048 tokens. If a document is shorter than 2048 tokens, we append another document until it fits 2048 tokens. If a document is greater than 2048 tokens, we split it across multiple documents. We release the dataset here

Usage

from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # `nomic-bert-2048` uses the standard BERT tokenizer

config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) # the config needs to be passed in
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048',config=config, trust_remote_code=True)

# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")

print(classifier("I [MASK] to the store yesterday."))

To finetune the model for a Sequence Classification task, you can use the following snippet

from transformers import AutoConfig, AutoModelForSequenceClassification
model_path = "nomic-ai/nomic-bert-2048"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# strict needs to be false here since we're initializing some new params
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, trust_remote_code=True, strict=False)

nomic-ai
/

nomic-bert-2048

nomic-bert-2048: A 2048 Sequence Length Pretrained BERT

Pretraining Data

Usage

Join the Nomic Community