license: apache-2.0
datasets:
- c4
language:
- en
MosaicBERT base model
Our goal in developing MosaicBERT was to greatly reduce pretraining time.
Model description
In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature. These include FlashAttention (Dao et al. 2022), ALiBi (Press et al. 2021), training in an unpadded manner, low precision LayerNorm, and Gated Linear Units (Shazeer 2020).
- Modifications to the Attention Mechanism FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM (i.e. short-term memory) [Dao et al. 2022]. We used the FlashAttention module built by hazy research with OpenAI’s triton library.
How to use
Training data
MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on the English “Colossal, Cleaned, Common Crawl” C4 dataset, which contains roughly 365 million curated text documents scraped from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining corpora like English Wikipedia and BooksCorpus.
Training procedure
Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
GLUE test results:
Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
---|---|---|---|---|---|---|---|---|---|