Edit model card

nomic-bert-2048: A 2048 Sequence Length Pretrained BERT

nomic-bert-2048 is a BERT model pretrained on wikipedia and bookcorpus with a max sequence length of 2048.

We make several modifications to our BERT training procedure similar to MosaicBERT. Namely, we add:

We evaluate the quality of nomic-bert-2048 on the standard GLUE benchmark. We find it performs comparably to other BERT models but with the advantage of a significantly longer context length.

Model Bsz Steps Seq Avg Cola SST2 MRPC STSB QQP MNLI QNLI RTE
NomicBERT 4k 100k 2048 0.84 0.50 0.93 0.88 0.90 0.92 0.86 0.92 0.82
RobertaBase 8k 500k 512 0.86 0.64 0.95 0.90 0.91 0.92 0.88 0.93 0.79
JinaBERTBase 4k 100k 512 0.83 0.51 0.95 0.88 0.90 0.81 0.86 0.92 0.79
MosaicBERT 4k 178k 128 0.85 0.59 0.94 0.89 0.90 0.92 0.86 0.91 0.83

Pretraining Data

We use BookCorpus and a 2023 dump of wikipedia. We pack and tokenize the sequences to 2048 tokens. If a document is shorter than 2048 tokens, we append another document until it fits 2048 tokens. If a document is greater than 2048 tokens, we split it across multiple documents. We release the dataset here

Usage

from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # `nomic-bert-2048` uses the standard BERT tokenizer

config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) # the config needs to be passed in
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048',config=config, trust_remote_code=True)

# To use this model directly for masked language modeling
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")

print(classifier("I [MASK] to the store yesterday."))

To finetune the model for a Sequence Classification task, you can use the following snippet

from transformers import AutoConfig, AutoModelForSequenceClassification
model_path = "nomic-ai/nomic-bert-2048"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
# strict needs to be false here since we're initializing some new params
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, trust_remote_code=True, strict=False)

Join the Nomic Community

Downloads last month
9,578
Safetensors
Model size
137M params
Tensor type
F32
·
Inference Examples
Mask token: undefined
Inference API (serverless) has been turned off for this model.

Datasets used to train nomic-ai/nomic-bert-2048

Collection including nomic-ai/nomic-bert-2048