|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- c4 |
|
language: |
|
- en |
|
--- |
|
|
|
# MosaicBERT base model |
|
Our goal in developing MosaicBERT was to greatly reduce pretraining time. |
|
|
|
## Model description |
|
|
|
In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature. |
|
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner, |
|
low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). |
|
|
|
1. Modifications to the Attention Mechanism |
|
FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer |
|
reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM |
|
(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by |
|
[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton). |
|
|
|
|
|
# How to use |
|
|
|
## Training data |
|
|
|
MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of |
|
text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on |
|
the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped |
|
from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining |
|
corpora like English Wikipedia and BooksCorpus. |
|
|
|
## Training procedure |
|
|
|
## Evaluation results |
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results: |
|
|
|
GLUE test results: |
|
|
|
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | |
|
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| |
|
| | | | | | | | | | | |
|
|
|
## Intended uses & limitations |