jacobfulano
commited on
Commit
•
d6f27fe
1
Parent(s):
3bb233a
Clarify how to load model in README
Browse files
README.md
CHANGED
@@ -34,42 +34,62 @@ The primary use case of these models is for research on efficient pretraining an
|
|
34 |
|
35 |
April 2023
|
36 |
|
|
|
|
|
|
|
|
|
37 |
## Documentation
|
38 |
|
39 |
-
* [
|
40 |
-
* [Github (mosaicml/examples/bert
|
|
|
|
|
|
|
|
|
41 |
|
42 |
## How to use
|
43 |
|
44 |
```python
|
45 |
-
|
46 |
-
|
47 |
-
|
|
|
48 |
|
49 |
-
|
50 |
|
51 |
-
|
52 |
-
|
53 |
-
|
|
|
|
|
|
|
54 |
```
|
55 |
|
56 |
-
|
|
|
|
|
|
|
57 |
|
58 |
```python
|
59 |
-
|
|
|
60 |
|
61 |
-
|
62 |
-
|
63 |
|
64 |
-
|
65 |
|
66 |
-
classifier("I [MASK] to the store yesterday.")
|
67 |
-
```
|
68 |
|
69 |
**To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training).
|
70 |
|
71 |
**To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning).
|
72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
### Remote Code
|
74 |
|
75 |
This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
|
|
|
34 |
|
35 |
April 2023
|
36 |
|
37 |
+
## Model Date
|
38 |
+
|
39 |
+
April 2023
|
40 |
+
|
41 |
## Documentation
|
42 |
|
43 |
+
* [Project Page (mosaicbert.github.io)](mosaicbert.github.io)
|
44 |
+
* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
|
45 |
+
* [Paper (NeurIPS 2023)](https://openreview.net/forum?id=5zipcfLC2Z)
|
46 |
+
* Colab Tutorials:
|
47 |
+
* [MosaicBERT Tutorial Part 1: Load Pretrained Weights and Experiment with Sequence Length Extrapolation Using ALiBi](https://colab.research.google.com/drive/1r0A3QEbu4Nzs2Jl6LaiNoW5EumIVqrGc?usp=sharing)
|
48 |
+
* [Blog Post (March 2023)](https://www.mosaicml.com/blog/mosaicbert)
|
49 |
|
50 |
## How to use
|
51 |
|
52 |
```python
|
53 |
+
import torch
|
54 |
+
import transformers
|
55 |
+
from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
|
56 |
+
from transformers import BertTokenizer, BertConfig
|
57 |
|
58 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # MosaicBERT uses the standard BERT tokenizer
|
59 |
|
60 |
+
config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base-seqlen-1024') # the config needs to be passed in
|
61 |
+
mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-1024',config=config,trust_remote_code=True)
|
62 |
+
|
63 |
+
# To use this model directly for masked language modeling
|
64 |
+
mosaicbert_classifier = pipeline('fill-mask', model=mosaicbert, tokenizer=tokenizer,device="cpu")
|
65 |
+
mosaicbert_classifier("I [MASK] to the store yesterday.")
|
66 |
```
|
67 |
|
68 |
+
Note that the tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
|
69 |
+
|
70 |
+
In order to take advantage of ALiBi by extrapolating to longer sequence lengths, simply change the `alibi_starting_size` flag in the
|
71 |
+
config file and reload the model.
|
72 |
|
73 |
```python
|
74 |
+
config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base-seqlen-1024')
|
75 |
+
config.alibi_starting_size = 2048 # maximum sequence length updated to 2048 from config default of 1024
|
76 |
|
77 |
+
mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base-seqlen-2048',config=config,trust_remote_code=True)
|
78 |
+
```
|
79 |
|
80 |
+
This simply presets the non-learned linear bias matrix in every attention block to 2048 tokens (note that this particular model was trained with a sequence length of 1024 tokens).
|
81 |
|
|
|
|
|
82 |
|
83 |
**To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#mlm-pre-training).
|
84 |
|
85 |
**To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert#single-task-fine-tuning).
|
86 |
|
87 |
+
### [Update 1/2/2024] Triton Flash Attention with ALiBi
|
88 |
+
|
89 |
+
Note that by default, triton Flash Attention is **not** enabled or required. In order to enable our custom implementation of triton Flash Attention with ALiBi from March 2023,
|
90 |
+
set `attention_probs_dropout_prob: 0.0`. We are currently working on supporting Flash Attention 2 (see [PR here](https://github.com/mosaicml/examples/pull/440)).
|
91 |
+
|
92 |
+
|
93 |
### Remote Code
|
94 |
|
95 |
This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
|