jacobfulano commited on
Commit
d09be0d
1 Parent(s): 17a6bc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md CHANGED
@@ -1,3 +1,179 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - c4
5
+ language:
6
+ - en
7
+ inference: false
8
  ---
9
+
10
+ # mosaic-bert-base-seqlen-512 model [MosaicBERT Family]
11
+
12
+ MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
13
+ MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
14
+ Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
15
+
16
+ __This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) and a sequence length of 512 tokens.__
17
+
18
+ It is part of the family of MosaicBERT-Base models:
19
+
20
+ * [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
21
+ * mosaic-bert-base-seqlen-512
22
+ * mosaic-bert-base-seqlen-1024
23
+ * mosaic-bert-base-seqlen-2048 (soon)
24
+
25
+ * ALiBi allows a model trained with a sequence length n to extrapolat to sequence lengths >2n
26
+
27
+ ## Model Date
28
+
29
+ April 2023
30
+
31
+ ## Documentation
32
+
33
+ * [Blog post](https://www.mosaicml.com/blog/mosaicbert)
34
+ * [Github (mosaicml/examples/bert repo)](https://github.com/mosaicml/examples/tree/main/examples/bert)
35
+
36
+ ## How to use
37
+
38
+ ```python
39
+ from transformers import AutoModelForMaskedLM
40
+ mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', trust_remote_code=True)
41
+ ```
42
+
43
+ The tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
44
+
45
+ ```python
46
+ from transformers import BertTokenizer
47
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
48
+ ```
49
+
50
+ To use this model directly for masked language modeling, use `pipeline`:
51
+
52
+ ```python
53
+ from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
54
+
55
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
56
+ mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', trust_remote_code=True)
57
+
58
+ classifier = pipeline('fill-mask', model=mlm, tokenizer=tokenizer)
59
+
60
+ classifier("I [MASK] to the store yesterday.")
61
+ ```
62
+
63
+ **To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training).
64
+
65
+ **To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning).
66
+
67
+ ### Remote Code
68
+
69
+ This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
70
+
71
+ ```python
72
+ mlm = AutoModelForMaskedLM.from_pretrained(
73
+ 'mosaicml/mosaic-bert-base',
74
+ trust_remote_code=True,
75
+ revision='24512df',
76
+ )
77
+ ```
78
+
79
+ However, if there are updates to this model or code and you specify a revision, you will need to manually check for them and update the commit hash accordingly.
80
+
81
+ ## MosaicBERT Model description
82
+
83
+ In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
84
+ These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409),
85
+ and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). In addition, we remove padding inside the transformer block,
86
+ and apply LayerNorm with low precision.
87
+
88
+ ### Modifications to the Attention Mechanism
89
+ 1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
90
+ reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
91
+ (i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
92
+ [hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
93
+
94
+ 2. **Attention with Linear Biases (ALiBi)**: In most BERT models, the positions of tokens in a sequence are encoded with a position embedding layer;
95
+ this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
96
+ instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
97
+ tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
98
+ model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/d14a7c94a0f805f56a7c865802082bf6d8ac8903/examples/bert/src/bert_layers.py#L425).
99
+
100
+ 3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
101
+ tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
102
+ padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
103
+ of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
104
+ operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
105
+ Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/bert/src/bert_padding.py).
106
+
107
+ 4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
108
+ Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
109
+
110
+ ### Modifications to the Feedforward Layers
111
+
112
+ 5. **Gated Linear Units (GLU)**: We used Gated Linear Units for the feedforward sublayer of a transformer. GLUs were first proposed in 2016 [[Dauphin et al. 2016]](https://arxiv.org/abs/1612.08083),
113
+ and incorporate an extra learnable matrix that “gates” the outputs of the feedforward layer. More recent work has shown that
114
+ GLUs can improve performance quality in transformers [[Shazeer, 2020](https://arxiv.org/abs/2002.05202), [Narang et al. 2021](https://arxiv.org/pdf/2102.11972.pdf)]. We used the GeLU (Gaussian-error Linear Unit)
115
+ activation function with GLU, which is sometimes referred to as GeGLU. The GeLU activation function is a smooth, fully differentiable
116
+ approximation to ReLU; we found that this led to a nominal improvement over ReLU. More details on our implementation of GLU can be found here.
117
+ The extra gating matrix in a GLU model potentially adds additional parameters to a model; we chose to augment our BERT-Base model with
118
+ additional parameters due to GLU modules as it leads to a Pareto improvement across all timescales (which is not true of all larger
119
+ models such as BERT-Large). While BERT-Base has 110 million parameters, MosaicBERT-Base has 137 million parameters. Note that
120
+ MosaicBERT-Base trains faster than BERT-Base despite having more parameters.
121
+
122
+
123
+
124
+
125
+ ## Training data
126
+
127
+ MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
128
+ text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
129
+ the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
130
+ from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
131
+ corpora like English Wikipedia and BooksCorpus.
132
+
133
+ ## Pretraining Optimizations
134
+
135
+ Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
136
+
137
+ 1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
138
+ for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4.
139
+
140
+
141
+ 2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
142
+ While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
143
+ subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692).
144
+ However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
145
+ We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
146
+ change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).
147
+
148
+ 3. **Bfloat16 Precision**: We use [bf16 (bfloat16) mixed precision training](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) for all the models, where a matrix multiplication layer uses bf16
149
+ for the multiplication and 32-bit IEEE floating point for gradient accumulation. We found this to be more stable than using float16 mixed precision.
150
+
151
+ 4. **Vocab Size as a Multiple of 64**: We increased the vocab size to be a multiple of 8 as well as 64 (i.e. from 30,522 to 30,528).
152
+ This small constraint is something of [a magic trick among ML practitioners](https://twitter.com/karpathy/status/1621578354024677377), and leads to a throughput speedup.
153
+
154
+ 5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
155
+ The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
156
+ Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
157
+ We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
158
+ For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
159
+ Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
160
+
161
+
162
+ ## Intended uses & limitations
163
+
164
+ This model is intended to be finetuned on downstream tasks.
165
+
166
+ ## Citation
167
+
168
+ Please cite this model using the following format:
169
+
170
+ ```
171
+ @online{Portes2023MosaicBERT,
172
+ author = {Jacob Portes and Alex Trott and Daniel King and Sam Havens},
173
+ title = {MosaicBERT: Pretraining BERT from Scratch for \$20},
174
+ year = {2023},
175
+ url = {https://www.mosaicml.com/blog/mosaicbert},
176
+ note = {Accessed: 2023-03-28}, % change this date
177
+ urldate = {2023-03-28} % change this date
178
+ }
179
+ ```