henhenhahi111112 commited on
Commit
6b51659
·
verified ·
1 Parent(s): af6e330

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -289
README.md CHANGED
@@ -1,290 +1,10 @@
1
  # OpenLM
2
-
3
- ![](/plots/logo.png)
4
-
5
- OpenLM is a minimal but performative language modeling (LM) repository, aimed to facilitate research on medium sized LMs. We have verified the performance of OpenLM up to 7B parameters and 256 GPUs.
6
- In contrast with other repositories such as Megatron, we depend only on PyTorch, XFormers, or Triton for our core modeling code.
7
-
8
- # Contents
9
- - [Release Notes](#release-notes)
10
- - [Quickstart](#quickstart)
11
- - [Setup](#setup)
12
- - [Process training data](#process-training-data)
13
- - [Run training](#run-training)
14
- - [Evaluate Model](#evaluate-model)
15
- - [Generate Text](#generate-text)
16
- - [Pretrained Models](#pretrained-models)
17
- - [Team and Acknowledgements](#team-and-acknowledgements)
18
-
19
- # Release Notes
20
- - 09/26/23: Public release and featured on [Laion Blog](https://laion.ai/blog/open-lm/)
21
- - 08/18/23: Updated README.md
22
- # Quickstart
23
- Here we'll go over a basic example where we start from a fresh install, download and preprocess some training data, and train a model.
24
-
25
- ## Setup
26
- We require python >=3.9, and a current installation of pyTorch, as well as several other packages. The full list of requirements is contained in `requirements.txt` and can be installed in your python enviornment via
27
- ```>>> pip install -r requirements.txt```
28
- Next, to access `open_lm` everywhere in your virtual environment, install it using pip (from within the top level github repo)
29
- ```>>> pip install --editable . ```
30
- Some considerations:
31
- - We like [WandB](https://wandb.ai/) and [tensorboard](https://www.tensorflow.org/tensorboard) for logging. We specify how to use these during training below.
32
-
33
- ## Process Training Data
34
- Next you must specify a collection of tokenized data. For the purposes of this example, we will use a recent dump of english Wikipedia, available on HuggingFace. To download this locally, we've included a script located at [open_lm/datapreprocess/wiki_download.py](open_lm/datapreprocess/wiki_download.py). All you have to do is specify an output directory for where the raw data should be stored:
35
- ```
36
- python open_lm/datapreprocess/wiki_download.py --output-dir path/to/raw_data
37
- ```
38
-
39
- Next we process our training data by running it through a BPE tokenizer and chunk it into chunks of appropriate length. By default we use the tokenizer attached with [GPT-NeoX-20B](https://github.com/EleutherAI/gpt-neox). To do this, use the script `datapreprocess/make_2048.py`:
40
- ```
41
- >>> python open_lm/datapreprocess/make_2048.py \
42
- --input-files path_to_raw_data/*.jsonl
43
- --output-dir preproc_data
44
- --num-workers 32
45
- --num-consumers 1
46
- ```
47
- Where `input-files` passes all of its (possibly many) arguments through the python `glob` module, allowing for wildcards. Optionally, data can be stored in S3 by setting the environment variables: `S3_BASE`, and passing the flag `--upload-to-s3` to the script. This saves sharded data to the given bucket with prefix of `S3_BASE`. E.g.
48
- ```
49
- >>> export S3_BASE=preproc_data-v1/
50
- >>> python open_lm/datapreprocess/make2048.py --upload-to-s3 ... # same arguments as before
51
- ```
52
-
53
- ## Run Training
54
- Tokenized data can now be passed to the main training script, `open_lm/main.py`. Distributed computatation is handled via `torchrun`, and hyperparameters are specified by a variety of keyword arguments. We highlight several of the most important ones here:
55
- - `train-data`: location of the sharded tokenized training data. If locally generated and stored, this will point to a directory containing files like `preproc_data/2048-v1/0/XXXXXXX.tar`. Data are processed using the [webdataset](https://github.com/webdataset/webdataset) package where wildcards are supported like `preproc_data/2048-v1/0/{0000000..0000099}.tar` to select the first 100 .tar files.
56
- - `model`: Which model to use. See the table below to see valid options and parameter sizes for each.
57
- - `train-num-samples`: how many samples to use from the specified training dataset
58
- - `name`: name of this particular training run for logging purposes
59
- - `report-to`: if present, can be `wandb`, `tensorboard`, or `all` to stash logging information on WandB or Tensorboard.
60
-
61
-
62
- Model choices are contained in the following table, where, for instance `11m` indicates an 11 million parameter model and `1b` indicates a 1 billion parameter model.
63
- <center>
64
-
65
- | Model Name |
66
- |---------------|
67
- | `open_lm_11m` |
68
- | `open_lm_25m` |
69
- | `open_lm_87m` |
70
- | `open_lm_160m`|
71
- | `open_lm_411m`|
72
- | `open_lm_830m`|
73
- | `open_lm_1b` |
74
- | `open_lm_3b` |
75
- | `open_lm_7b` |
76
-
77
- </center>
78
-
79
- An example training run can be called as follows:
80
- ```
81
- >>> export CUDA_VISIBLE_DEVICES=0,1,2,3
82
- >>> torchrun --nproc-per-node 4 -m open_lm.main \
83
- --model open_lm_3b \
84
- --train-data /preproc_data/shard-{0000000..0000099}.tar \
85
- --train-num-samples 1000000000 \
86
- --workers 8 \
87
- --dataset-resampled \
88
- --precision amp_bfloat16 \
89
- --batch-size 8 \
90
- --grad-checkpointing \
91
- --log-every-n-steps 100 \
92
- --grad-clip-norm 1 \
93
- --data-key txt \
94
- --lr 3e-4 \
95
- --fsdp --fsdp-amp \
96
- --warmup 2000 \
97
- --wd 0.1 \
98
- --beta2 0.95 \
99
- --epochs 100 \
100
- --report-to wandb \
101
- --wandb-project-name open_lm_example \
102
- --name open_lm_ex_$RANDOM \
103
- --resume latest \
104
- --logs path/to/logging/dir/
105
- ```
106
- Checkpoints and final model weights will be saved to the specified logs directory.
107
-
108
- During training, the above command will pick shards to train on via sampling with replacement. Training can also be done by picking shards via sampling without replacement. To do this, the input dataset(s) must first be preprocessed using the following command:
109
- ```
110
- python -m open_lm.utils.make_wds_manifest --data-dir /preproc_data/
111
- ```
112
- This will create a file called ```manifest.jsonl``` under ```/preproc_data```. Training can then be done by sampling wihout replacement via the following example commands:
113
- ```
114
- >>> export CUDA_VISIBLE_DEVICES=0,1,2,3
115
- >>> torchrun --nproc-per-node 4 -m open_lm.main \
116
- --model open_lm_3b \
117
- --dataset-manifest /preproc_data/manifest.jsonl \
118
- --train-num-samples 1000000000 \
119
- --workers 8 \
120
- --precision amp_bfloat16 \
121
- --batch-size 8 \
122
- --grad-checkpointing \
123
- --log-every-n-steps 100 \
124
- --grad-clip-norm 1 \
125
- --data-key txt \
126
- --lr 3e-4 \
127
- --fsdp --fsdp-amp \
128
- --warmup 2000 \
129
- --wd 0.1 \
130
- --beta2 0.95 \
131
- --epochs 100 \
132
- --report-to wandb \
133
- --wandb-project-name open_lm_example \
134
- --name open_lm_ex_$RANDOM \
135
- --resume latest \
136
- --logs path/to/logging/dir/
137
- ```
138
-
139
- ### Dataset manifest
140
-
141
- The manifest created with `open_lm/utils/make_wds_manifest.py` is a `jsonl` file describing the dataset. Each line in this file corresponds to a shard of the dataset and is a `json` object containing two fields:
142
-
143
- - `"shard"`: the name of a shard in the dataset.
144
- - `"num_sequences"`: the number of sequences contained in the shards. Each sequence contains a set length of tokens.
145
-
146
- This manifest file provides auxiliary information about the dataset, and is assumed to be found within the same directory as the shards.
147
-
148
- ## Evaluate Model
149
- Once trained, we can evaluate the model. This requires [LLM Foundry](https://github.com/mosaicml/llm-foundry), which can be installed via `pip install llm-foundry`. Next some configurations are required to pass to the evaluator: a skeleton of these parameters is located at [eval/in_memory_hf_eval.yaml](eval/in_memory_hf_eval.yaml). Then just run the following script, making sure to point it at the checkpoint of your trained model (and it's correspending config .json file):
150
- ```
151
- cd eval
152
-
153
- python eval_openlm_ckpt.py \
154
- --eval-yaml in_memory_hf_eval.yaml \
155
- --model open_lm_1b \
156
- --checkpoint /path/to/openlm_checkpoint.pt
157
- --positional_embedding_type head_rotary
158
-
159
- ```
160
- Note that `--positional-embedding-type head_rotary` is only necessary if using the pretrained `open_lm_1b` model hosted below. See discussion in the next section about this.
161
-
162
- ## Generate Text
163
- One can also use a trained model to generate text. This is accessible via the script located at [scripts/generate.py](scripts/generate.py). The parameters are similar to those used in evaluation:
164
- ```
165
- cd scripts
166
-
167
- python generate.py \
168
- --model open_lm_1b \
169
- --checkpoint /path/to/openlm_checkpoint.pt \
170
- --positional-embedding-type head_rotary \
171
- --input-text "Please give me a recipe for chocolate chip cookies"
172
- ```
173
-
174
- Again, note that `--positional-embedding-type head_rotary` is only necessary for the pretrained `open_lm_1b` model hosted below.
175
-
176
- # Pretrained Models
177
-
178
- ## [OpenLM 1B](https://huggingface.co/mlfoundations/open_lm_1B)
179
- OpenLM 1B is a ~1Billion parameter model trained on a 1.6T token dataset which consists of a mix of RedPajama, Pile, S2ORC, The Pile of Law, Deepmind Math, and RealNews (the full mixture of training data is described in [more detail here](https://docs.google.com/spreadsheets/d/1YW-_1vGsSPmVtEt2oeeJOecH6dYX2SuEuhOwZyGwy4k/edit?usp=sharing)).
180
- The model checkpoint can be downloaded from [HuggingFace here](https://huggingface.co/mlfoundations/open_lm_1B/tree/main).
181
- The script used to train this model (for config-copying purposes) is [located here](https://github.com/mlfoundations/open_lm/blob/main/scripts/train_example.sh).
182
- Once this checkpoint has been downloaded, you can evaluate it by following the directions in the [Evaluate Model](#evaluate-model) section above and passing `--positional-embedding-type head_rotary` or setting `"positional_embedding_type": "head_rotary"` in the model config (see note below).
183
-
184
- Note: We trained this model with rotary embeddings applied to the _head_
185
- dimension, which is the default in xformers as of 09/01/2023. Since these models
186
- were trained, we have updated openlm to correctly apply the rotary embeddings to
187
- the sequence dimension (see
188
- [this issue](https://github.com/mlfoundations/open_lm/issues/4) and [this
189
- issue](https://github.com/facebookresearch/xformers/issues/841) for details).
190
- To evaluate these models, ensure you use the `"positional_embedding_type": "head_rotary"` in the model config.
191
-
192
- | **OpenLM-1B** | **250B Tokens** | **500B tokens** | **750B tokens** | **1T Tokens** | **1.25T Tokens** | **1.5T Tokens** | **1.6T Tokens** |
193
- |----------------|-----------------|-----------------|-----------------|---------------|------------------|-----------------|-----------------|
194
- | | | | | | | | |
195
- | arc_challenge | 0.27 | 0.28 | 0.29 | 0.28 | 0.29 | 0.31 | 0.31 |
196
- | arc_easy | 0.49 | 0.50 | 0.51 | 0.53 | 0.54 | 0.56 | 0.56 |
197
- | boolq | 0.60 | 0.61 | 0.62 | 0.62 | 0.65 | 0.64 | 0.65 |
198
- | copa | 0.71 | 0.70 | 0.70 | 0.78 | 0.71 | 0.73 | 0.70 |
199
- | hellaswag | 0.50 | 0.54 | 0.54 | 0.57 | 0.59 | 0.61 | 0.61 |
200
- | lambada_openai | 0.56 | 0.57 | 0.61 | 0.61 | 0.65 | 0.65 | 0.66 |
201
- | piqa | 0.70 | 0.70 | 0.71 | 0.72 | 0.73 | 0.74 | 0.74 |
202
- | triviaqa | | | | | | | |
203
- | winogrande | 0.55 | 0.57 | 0.58 | 0.59 | 0.61 | 0.60 | 0.60 |
204
- | MMLU | 0.24 | 0.24 | 0.24 | 0.23 | 0.26 | 0.24 | 0.25 |
205
- | Jeopardy | 0.01 | 0.02 | 0.01 | 0.01 | 0.04 | 0.09 | 0.10 |
206
- | Winograd | 0.75 | 0.77 | 0.77 | 0.79 | 0.81 | 0.80 | 0.79 |
207
- | | | | | | | | |
208
- | **Average** | **0.49** | **0.50** | **0.51** | **0.52** | **0.53** | **0.54** | **0.54** |
209
-
210
-
211
- | **1B Baselines** | **OPT-1.3B** | **Pythia-1B** | **Neox-1.3B** | **OPT-IML-1.3B** |
212
- |------------------|-------------:|--------------:|--------------:|-----------------:|
213
- | arc_challenge | 0.27 | 0.26 | 0.26 | 0.30 |
214
- | arc_easy | 0.49 | 0.51 | 0.47 | 0.58 |
215
- | boolq | 0.58 | 0.61 | 0.62 | 0.72 |
216
- | copa | 0.75 | 0.68 | 0.72 | 0.73 |
217
- | hellaswag | 0.54 | 0.49 | 0.48 | 0.54 |
218
- | lambada_openai | 0.59 | 0.58 | 0.57 | 0.57 |
219
- | piqa | 0.72 | 0.70 | 0.72 | 0.73 |
220
- | triviaqa | | | | |
221
- | winogrande | 0.59 | 0.53 | 0.55 | 0.59 |
222
- | MMLU | 0.25 | 0.26 | 0.26 | 0.30 |
223
- | Jeopardy | 0.01 | 0.00 | 0.00 | 0.12 |
224
- | Winograd | 0.74 | 0.71 | 0.75 | 0.73 |
225
- | **Average** | **0.50** | **0.48** | **0.49** | **0.54** |
226
-
227
-
228
- ## [OpenLM 7B](https://huggingface.co/mlfoundations/open_lm_7B_1.25T)
229
- OpenLM 7B is not yet done training, but we've released a checkpoint at 1.25T tokens. Information is the same as for OpenLM-1B above, including the information pertaining to rotary embeddings.
230
-
231
-
232
- | **OpenLM-7B** | **275B Tokens** | **500B tokens** | **675B tokens** | **775B tokens** | **1T Tokens** | **1.25T Tokens** | **1.5T Tokens** | **1.6T Tokens** | **LLAMA-7B** | **MPT-7B** |
233
- |-----------------|-----------------|-----------------|-----------------|-----------------|---------------|------------------|-----------------|-----------------|--------------|------------|
234
- | arc_challenge | 0.35 | 0.35 | 0.36 | 0.37 | 0.39 | 0.39 | | | 0.41 | 0.39 |
235
- | arc_easy | 0.60 | 0.61 | 0.62 | 0.62 | 0.63 | 0.66 | | | 0.65 | 0.67 |
236
- | boolq | 0.67 | 0.66 | 0.69 | 0.69 | 0.70 | 0.70 | | | 0.77 | 0.75 |
237
- | copa | 0.75 | 0.79 | 0.75 | 0.80 | 0.80 | 0.78 | | | 0.78 | 0.81 |
238
- | hellaswag | 0.64 | 0.67 | 0.68 | 0.68 | 0.69 | 0.70 | | | 0.75 | 0.76 |
239
- | lambada_openai | 0.67 | 0.68 | 0.69 | 0.70 | 0.70 | 0.70 | | | 0.74 | 0.70 |
240
- | piqa | 0.75 | 0.76 | 0.76 | 0.76 | 0.77 | 0.77 | | | 0.79 | 0.80 |
241
- | triviaqa | | | | | | | | | | |
242
- | winogrande | 0.62 | 0.65 | 0.65 | 0.65 | 0.67 | 0.67 | | | 0.68 | 0.68 |
243
- | MMLU-0 shot | 0.25 | 0.25 | 0.27 | 0.27 | 0.28 | 0.30 | | | 0.30 | 0.30 |
244
- | Jeopardy | 0.15 | 0.18 | 0.23 | 0.22 | 0.16 | 0.21 | | | 0.33 | 0.31 |
245
- | Winograd | 0.82 | 0.81 | 0.84 | 0.84 | 0.85 | 0.86 | | | 0.81 | 0.88 |
246
- | | | | | | | | | | | |
247
- | **Average** | **0.57** | **0.58** | **0.60** | **0.60** | **0.60** | **0.61** | | | **0.64** | **0.64** |
248
- | **MMLU-5 shot** | | | | | | **0.34** | | | **0.34** | |
249
-
250
- # Unit tests
251
-
252
- For unit tests we use `pytest`. Either
253
-
254
- ```
255
- pip install pytest
256
- ```
257
- or create the `open_lm_tests` conda environment by running
258
- ```
259
- conda env create --file environment-tests.yml
260
- ```
261
-
262
- Tests live in the `tests/` folder.
263
-
264
- To run tests make sure you are on a machine with a GPU and run:
265
- ```
266
- pytest tests/
267
- ```
268
-
269
- # Team and acknowledgements
270
-
271
- Team (so-far, * = equal contrib): Suchin Gururangan*, Mitchell Wortsman*, Samir Yitzhak Gadre*, Achal Dave*, Maciej Kilian, Weijia Shi, Jean Mercat, Georgios Smyrnis, Gabriel Ilharco, Matt Jordan, Reinhard Heckel, Alex Dimakis, Ali Farhadi, Vaishaal Shankar*, Ludwig Schmidt.
272
-
273
- Code is based heavily on [open-clip](https://github.com/mlfoundations/open_clip) developed by a team including Ross Wightman, Romain Beaumont, Cade Gordon, Mehdi Cherti, Jenia Jitsev, and [open-flamingo](https://github.com/mlfoundations/open_flamingo), developed by a team including Anas Awadalla and Irena Gao. Additional inspiration is from [lit-llama](https://github.com/Lightning-AI/lit-llama).
274
- We are greatful to stability.ai for resource support.
275
- OpenLM is developed by researchers from various affiliations including the [RAIVN Lab](https://raivn.cs.washington.edu/) at the University of Washington, [UWNLP](https://nlp.washington.edu/), [Toyota Research Institute](https://www.tri.global/), [Columbia University](https://www.columbia.edu/), and more.
276
-
277
- Citation
278
- --------
279
-
280
- If you use this model in your work, please use the following BibTeX citation:
281
- ```bibtex
282
- @misc{open_lm,
283
- author = {Gururangan, Suchin and Wortsman, Mitchell and Gadre, Samir Yitzhak and Dave, Achal and Kilian, Maciej and Shi, Weijia and Mercat, Jean and Smyrnis, Georgios and Ilharco, Gabriel and Jordan, Matt and Heckel, Reinhard and Dimakis, Alex and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig},
284
- title = {{open_lm}: a minimal but performative language modeling (LM) repository},
285
- year = {2023},
286
- note = {GitHub repository},
287
- url = {https://github.com/mlfoundations/open_lm/}
288
- }
289
- ```
290
-
 
1
  # OpenLM
2
+ ## Environment
3
+ ```shell
4
+ cd henhenhahimodel
5
+ pip install -r requirements.txt
6
+ ```
7
+ ## Usage
8
+ ```shell
9
+ python3 scripts/generate.py --checkpoint logs/test_alpaca_7b_1p25_240612/checkpoints/epoch_1.pt --positional-embedding-type head_rotary --input-text '{"instruction":"Using the provided data, what is the most common pet in this household?","input":"The household has 3 cats, 2 dogs, and 1 rabbit."}'
10
+ ```