Add generate module and update README.

#1
Files changed (2) hide show
  1. README.md +12 -37
  2. modeling_openelm.py +3 -3
README.md CHANGED
@@ -8,9 +8,9 @@ license_link: LICENSE
8
 
9
  *Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari*
10
 
11
- We introduce **OpenELM**, a family of **Open** **E**fficient **L**anguage **M**odels. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. We pretrained OpenELM models using the [CoreNet](https://github.com/apple/corenet) library. We release both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters.
12
 
13
- Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens. Please check license agreements and terms of these datasets before using them.
14
 
15
 
16
 
@@ -28,11 +28,12 @@ Additional arguments to the hugging face generate function can be passed via `ge
28
  ```
29
  python generate_openelm.py --model apple/OpenELM-270M --hf_access_token [HF_ACCESS_TOKEN] --prompt 'Once upon a time there was' --generate_kwargs repetition_penalty=1.2 prompt_lookup_num_tokens=10
30
  ```
31
- Alternatively, try model-wise speculative generation with an [assistive model](https://huggingface.co/blog/assisted-generation) by passing a smaller model through the `assistant_model` argument, for example:
32
  ```
33
  python generate_openelm.py --model apple/OpenELM-270M --hf_access_token [HF_ACCESS_TOKEN] --prompt 'Once upon a time there was' --generate_kwargs repetition_penalty=1.2 --assistant_model [SMALLER_MODEL]
34
  ```
35
 
 
36
  ## Main Results
37
 
38
  ### Zero-Shot
@@ -106,10 +107,9 @@ pip install tokenizers>=0.15.2 transformers>=4.38.2 sentencepiece>=0.2.0
106
  ```bash
107
 
108
  # OpenELM-270M
109
- hf_model=apple/OpenELM-270M
110
 
111
- # this flag is needed because lm-eval-harness set add_bos_token to False by default, but OpenELM uses LLaMA tokenizer which requires add_bos_token to be True
112
- tokenizer=meta-llama/Llama-2-7b-hf
113
  add_bos_token=True
114
  batch_size=1
115
 
@@ -118,7 +118,7 @@ mkdir lm_eval_output
118
  shot=0
119
  task=arc_challenge,arc_easy,boolq,hellaswag,piqa,race,winogrande,sciq,truthfulqa_mc2
120
  lm_eval --model hf \
121
- --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token},tokenizer=${tokenizer} \
122
  --tasks ${task} \
123
  --device cuda:0 \
124
  --num_fewshot ${shot} \
@@ -128,7 +128,7 @@ lm_eval --model hf \
128
  shot=5
129
  task=mmlu,winogrande
130
  lm_eval --model hf \
131
- --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token},tokenizer=${tokenizer} \
132
  --tasks ${task} \
133
  --device cuda:0 \
134
  --num_fewshot ${shot} \
@@ -138,7 +138,7 @@ lm_eval --model hf \
138
  shot=25
139
  task=arc_challenge,crows_pairs_english
140
  lm_eval --model hf \
141
- --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token},tokenizer=${tokenizer} \
142
  --tasks ${task} \
143
  --device cuda:0 \
144
  --num_fewshot ${shot} \
@@ -148,7 +148,7 @@ lm_eval --model hf \
148
  shot=10
149
  task=hellaswag
150
  lm_eval --model hf \
151
- --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token},tokenizer=${tokenizer} \
152
  --tasks ${task} \
153
  --device cuda:0 \
154
  --num_fewshot ${shot} \
@@ -160,30 +160,5 @@ lm_eval --model hf \
160
 
161
  ## Bias, Risks, and Limitations
162
 
163
- The release of OpenELM models aims to empower and enrich the open research community by providing access to state-of-the-art language models. Trained on publicly available datasets, these models are made available without any safety guarantees. Consequently, there exists the possibility of these models producing outputs that are inaccurate, harmful, biased, or objectionable in response to user prompts. Thus, it is imperative for users and developers to undertake thorough safety testing and implement appropriate filtering mechanisms tailored to their specific requirements.
164
-
165
- ## Citation
166
-
167
- If you find our work useful, please cite:
168
-
169
- ```BibTex
170
- @article{mehtaOpenELMEfficientLanguage2024,
171
- title = {{OpenELM}: {An} {Efficient} {Language} {Model} {Family} with {Open} {Training} and {Inference} {Framework}},
172
- shorttitle = {{OpenELM}},
173
- url = {https://arxiv.org/abs/2404.14619v1},
174
- language = {en},
175
- urldate = {2024-04-24},
176
- journal = {arXiv.org},
177
- author = {Mehta, Sachin and Sekhavat, Mohammad Hossein and Cao, Qingqing and Horton, Maxwell and Jin, Yanzi and Sun, Chenfan and Mirzadeh, Iman and Najibi, Mahyar and Belenko, Dmitry and Zatloukal, Peter and Rastegari, Mohammad},
178
- month = apr,
179
- year = {2024},
180
- }
181
-
182
- @inproceedings{mehta2022cvnets,
183
- author = {Mehta, Sachin and Abdolhosseini, Farzad and Rastegari, Mohammad},
184
- title = {CVNets: High Performance Library for Computer Vision},
185
- year = {2022},
186
- booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
187
- series = {MM '22}
188
- }
189
- ```
 
8
 
9
  *Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari*
10
 
11
+ We introduce **OpenELM**, a family of **Open**-source **E**fficient **L**anguage **M**odels. We release both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters.
12
 
13
+ Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens.
14
 
15
 
16
 
 
28
  ```
29
  python generate_openelm.py --model apple/OpenELM-270M --hf_access_token [HF_ACCESS_TOKEN] --prompt 'Once upon a time there was' --generate_kwargs repetition_penalty=1.2 prompt_lookup_num_tokens=10
30
  ```
31
+ Alternatively, model-wise speculative generation with an [assistive model](https://huggingface.co/blog/assisted-generation) can be also tried by passing a smaller model model through the `assistant_model` argument, for example:
32
  ```
33
  python generate_openelm.py --model apple/OpenELM-270M --hf_access_token [HF_ACCESS_TOKEN] --prompt 'Once upon a time there was' --generate_kwargs repetition_penalty=1.2 --assistant_model [SMALLER_MODEL]
34
  ```
35
 
36
+
37
  ## Main Results
38
 
39
  ### Zero-Shot
 
107
  ```bash
108
 
109
  # OpenELM-270M
110
+ hf_model=OpenELM-270M
111
 
112
+ # this flag is needed because lm-eval-harness set add_bos_token to False by default, but OpenELM uses LLaMa tokenizer which requires add_bos_token to be True
 
113
  add_bos_token=True
114
  batch_size=1
115
 
 
118
  shot=0
119
  task=arc_challenge,arc_easy,boolq,hellaswag,piqa,race,winogrande,sciq,truthfulqa_mc2
120
  lm_eval --model hf \
121
+ --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token} \
122
  --tasks ${task} \
123
  --device cuda:0 \
124
  --num_fewshot ${shot} \
 
128
  shot=5
129
  task=mmlu,winogrande
130
  lm_eval --model hf \
131
+ --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token} \
132
  --tasks ${task} \
133
  --device cuda:0 \
134
  --num_fewshot ${shot} \
 
138
  shot=25
139
  task=arc_challenge,crows_pairs_english
140
  lm_eval --model hf \
141
+ --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token} \
142
  --tasks ${task} \
143
  --device cuda:0 \
144
  --num_fewshot ${shot} \
 
148
  shot=10
149
  task=hellaswag
150
  lm_eval --model hf \
151
+ --model_args pretrained=${hf_model},trust_remote_code=True,add_bos_token=${add_bos_token} \
152
  --tasks ${task} \
153
  --device cuda:0 \
154
  --num_fewshot ${shot} \
 
160
 
161
  ## Bias, Risks, and Limitations
162
 
163
+ Our OpenELM models are not trained with any safety guarantees, the model outputs can be potentially inaccurate, harmful or contain biased information. produce inaccurate, biased or other objectionable responses to user prompts. Therefore, users and developers should conduct extensive safety testing and filtering suited to their specific needs.
164
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
modeling_openelm.py CHANGED
@@ -783,7 +783,7 @@ class OpenELMModel(OpenELMPreTrainedModel):
783
  )
784
 
785
  if self.config._attn_implementation == "sdpa" and attention_mask is not None:
786
- # For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
787
  is_tracing = (
788
  torch.jit.is_tracing()
789
  or isinstance(input_tensor, torch.fx.Proxy)
@@ -967,7 +967,7 @@ class OpenELMForCausalLM(OpenELMPreTrainedModel):
967
  input_ids = input_ids[:, past_length:]
968
  position_ids = position_ids[:, past_length:]
969
 
970
- # we should only keep a `cache_position` in generate, and do +=1.
971
  # same goes for position ids. Could also help with continued generation.
972
  cache_position = torch.arange(
973
  past_length,
@@ -981,7 +981,7 @@ class OpenELMForCausalLM(OpenELMPreTrainedModel):
981
  else:
982
  # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
983
  # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
984
- # We could use `next_tokens` directly instead.
985
  model_inputs = {"input_ids": input_ids.contiguous()}
986
 
987
  model_inputs.update(
 
783
  )
784
 
785
  if self.config._attn_implementation == "sdpa" and attention_mask is not None:
786
+ # TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
787
  is_tracing = (
788
  torch.jit.is_tracing()
789
  or isinstance(input_tensor, torch.fx.Proxy)
 
967
  input_ids = input_ids[:, past_length:]
968
  position_ids = position_ids[:, past_length:]
969
 
970
+ # TODO @gante we should only keep a `cache_position` in generate, and do +=1.
971
  # same goes for position ids. Could also help with continued generation.
972
  cache_position = torch.arange(
973
  past_length,
 
981
  else:
982
  # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
983
  # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
984
+ # TODO: use `next_tokens` directly instead.
985
  model_inputs = {"input_ids": input_ids.contiguous()}
986
 
987
  model_inputs.update(