TheBloke commited on
Commit
bc2f454
1 Parent(s): a505914

Add config.json, Llama modelling code and monkey patch

Browse files
Files changed (4) hide show
  1. README.md +21 -210
  2. config.json +7 -2
  3. llama_rope_scaled_monkey_patch.py +65 -0
  4. modelling_llama.py +894 -0
README.md CHANGED
@@ -1,12 +1,6 @@
1
  ---
2
  inference: false
3
  license: other
4
- datasets:
5
- - databricks/databricks-dolly-15k
6
- - OpenAssistant/oasst1
7
- - sahil2801/CodeAlpaca-20k
8
- language:
9
- - en
10
  ---
11
 
12
  <!-- header start -->
@@ -23,62 +17,40 @@ language:
23
  </div>
24
  <!-- header end -->
25
 
26
- # Allen AI's Tulu 30B merged with Kaio Ken's SuperHOT 8K - GPTQ
27
 
28
- These files are GPTQ 4bit model files for [Allen AI's Tulu 30B](https://huggingface.co/allenai/tulu-30b) merged with [Kaio Ken's SuperHOT 30B 8K LoRA](https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test) to produce a model capable of 8K context.
29
 
30
  It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
31
 
32
- **This is an experimental new GPTQ which offers up to 8K context size**
33
-
34
- The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
35
-
36
- It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
37
-
38
- Code credits:
39
- - Original concept and code for inreasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
40
- - Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
41
-
42
- Please read carefully below to see how to use it.
43
-
44
- **NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
45
-
46
- GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
47
-
48
  ## Repositories available
49
 
50
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Tulu-30B-SuperHOT-8K-GPTQ)
 
51
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Tulu-30B-SuperHOT-8K-fp16)
52
 
53
- GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
54
-
55
  ## How to easily download and use this model in text-generation-webui
56
 
57
  Please make sure you're using the latest version of text-generation-webui
58
 
59
  1. Click the **Model tab**.
60
- 2. Under **Download custom model or LoRA**, enter `TheBloke/Tulu-30B-SuperHOT-8K-GPTQ.
61
  3. Click **Download**.
62
  4. The model will start downloading. Once it's finished it will say "Done"
63
- 5. Untick **Autoload the model**
64
- 6. In the top left, click the refresh icon next to **Model**.
65
- 7. In the **Model** dropdown, choose the model you just downloaded: `Tulu-30B-SuperHOT-8K-GPTQ`
66
- 8. To use the increased context, set the **Loader** to **ExLlama**, set **max_seq_len** to 8192 or 4096, and set **compress_pos_emb** to **4** for 8192 context, or to **2** for 4096 context.
67
- 9. Now click **Save Settings** followed by **Reload**
68
- 10. The model will automatically load, and is now ready for use!
69
- 11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
70
 
71
- ## How to use this GPTQ model from Python code with AutoGPTQ
72
 
73
- First make sure you have AutoGPTQ and Einops installed:
74
 
75
- ```
76
- pip3 install einops auto-gptq
77
- ```
78
 
79
- Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
80
-
81
- If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
82
 
83
  ```python
84
  from transformers import AutoTokenizer, pipeline, logging
@@ -95,13 +67,11 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
95
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
96
  model_basename=model_basename,
97
  use_safetensors=True,
98
- trust_remote_code=True,
99
- device_map='auto',
100
  use_triton=use_triton,
101
  quantize_config=None)
102
 
103
- model.seqlen = 8192
104
-
105
  # Note: check the prompt template is correct for this model.
106
  prompt = "Tell me about AI"
107
  prompt_template=f'''USER: {prompt}
@@ -132,13 +102,6 @@ pipe = pipeline(
132
  print(pipe(prompt_template)[0]['generated_text'])
133
  ```
134
 
135
- ## Using other UIs: monkey patch
136
-
137
- Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
138
-
139
- It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`. I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
140
-
141
-
142
  ## Provided files
143
 
144
  **tulu-30b-superhot-8k-GPTQ-4bit--1g.act.order.safetensors**
@@ -148,9 +111,9 @@ This will work with AutoGPTQ, ExLlama, and CUDA versions of GPTQ-for-LLaMa. Ther
148
  It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
149
 
150
  * `tulu-30b-superhot-8k-GPTQ-4bit--1g.act.order.safetensors`
151
- * Designed for use with ExLlama with increased context (4096 or 8192)
152
- * Should work with AutoGPTQ in CUDA or Triton modes, but without increased context - TBC.
153
- * Should work with GPTQ-for-LLaMa in CUDA mode, but without increased context - TBC. May have issues with GPTQ-for-LLaMa Triton mode.
154
  * Works with text-generation-webui, including one-click-installers.
155
  * Parameters: Groupsize = -1. Act Order / desc_act = True.
156
 
@@ -182,158 +145,6 @@ Thank you to all my generous patrons and donaters!
182
 
183
  <!-- footer end -->
184
 
185
- # Original model card: Kaio Ken's SuperHOT 30B 8K
186
-
187
-
188
- ### SuperHOT Prototype 2 w/ 8K Context
189
-
190
- This is a second prototype of SuperHOT, this time 30B with 8K context and no RLHF, using the same technique described in [the github blog](https://kaiokendev.github.io/til#extending-context-to-8k).
191
- Tests have shown that the model does indeed leverage the extended context at 8K.
192
-
193
- You will need to **use either the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.25 and the maximum sequence length to 8192**
194
-
195
- #### Looking for Merged & Quantized Models?
196
- - 30B 4-bit CUDA: [tmpupload/superhot-30b-8k-4bit-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-safetensors)
197
- - 30B 4-bit CUDA 128g: [tmpupload/superhot-30b-8k-4bit-128g-safetensors](https://huggingface.co/tmpupload/superhot-30b-8k-4bit-128g-safetensors)
198
-
199
-
200
- #### Training Details
201
- I trained the LoRA with the following configuration:
202
- - 1200 samples (~400 samples over 2048 sequence length)
203
- - learning rate of 3e-4
204
- - 3 epochs
205
- - The exported modules are:
206
- - q_proj
207
- - k_proj
208
- - v_proj
209
- - o_proj
210
- - no bias
211
- - Rank = 4
212
- - Alpha = 8
213
- - no dropout
214
- - weight decay of 0.1
215
- - AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
216
- - Trained on 4-bit base model
217
-
218
- # Original model card: Allen AI's Tulu 30B
219
-
220
-
221
- # Tulu 30B
222
-
223
- This model is a 30B LLaMa model finetuned on a mixture of instruction datasets (FLAN V2, CoT, Dolly, Open Assistant 1, GPT4-Alpaca, Code-Alpaca, and ShareGPT).
224
- *Please note this is a model diff - see below for usage instructions*.
225
-
226
- This was trained as part of the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751).
227
- The codebase used to train and evaluate this model can be found at [https://github.com/allenai/open-instruct](https://github.com/allenai/open-instruct).
228
-
229
- This model is licensed under the AI model license given in LICENSE.txt along with the original Llama license (llama_license.txt).
230
-
231
- ## Usage
232
-
233
- We assume you have access to a LLaMa model in HF format already. You can find details on getting access and converting the model here:
234
- [https://huggingface.co/docs/transformers/main/model_doc/llama](https://huggingface.co/docs/transformers/main/model_doc/llama)
235
-
236
- Clone [https://github.com/allenai/open-instruct](https://github.com/allenai/open-instruct) and install the required dependencies, or just copy `scripts/weight_diff.py`
237
- and install the minimal requirements listed in `weight-diff-requirements.txt`. Then download or clone this model diff to the same machine.
238
 
239
- Then, run:
240
- ```bash
241
- python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
242
- ```
243
-
244
- And you will have a recovered model! Note this takes up a decent amount of RAM, especially for the larger models.
245
-
246
- ## Input Format
247
-
248
- The model is trained to use the following format (note the newlines):
249
- ```
250
- <|user|>
251
- Your message here!
252
- <|assistant|>
253
- ```
254
-
255
- For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.**
256
-
257
- ## Performance
258
-
259
- Here is the performance of this model across benchmarks explored in our paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751):
260
-
261
- | MMLU 0-shot | MMLU 5-shot | GSM Direct | GSM CoT | BBH Direct | BBH CoT | TydiQA Gold-Passage | TydiQA Closed-book | Codex-Eval Pass@1 | Codex-Eval Pass@10 | AlpacaFarm vs Davinci-003 | Average |
262
- |:-----------:|:-----------:|:----------:|:-------:|:----------:|:-------:|:-------------------:|:------------------:|:-----------------:|:------------------:|:-------------------------:|---------|
263
- | 57.7 | 58.4 | 6.0 | 51.0 | 45.8 | 48.7 | 58.2 | 12.3 | 25.4 | 46.0 | 63.5 | 44.7 |
264
-
265
- If you use this model, please cite our work, the llama paper, and the original datasets:
266
-
267
- ```
268
- @misc{wang2023far,
269
- title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources},
270
- author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
271
- year={2023},
272
- eprint={2306.04751},
273
- archivePrefix={arXiv},
274
- primaryClass={cs.CL}
275
- }
276
- ```
277
-
278
- ```
279
- @misc{touvron2023llama,
280
- title={LLaMA: Open and Efficient Foundation Language Models},
281
- author={Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample},
282
- year={2023},
283
- eprint={2302.13971},
284
- archivePrefix={arXiv},
285
- primaryClass={cs.CL}
286
- }
287
- ```
288
-
289
- ```
290
- @misc{dolly,
291
- author = {Databricks},
292
- title = {Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM},
293
- year = {2023},
294
- publisher = {GitHub},
295
- journal = {GitHub repository},
296
- howpublished = {Blog post},
297
- url = {https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm}
298
- }
299
- ```
300
-
301
- ```
302
- @article{longpre2023flan,
303
- title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning},
304
- author={Longpre, Shayne and Hou, Le and Vu, Tu and Webson, Albert and Chung, Hyung Won and Tay, Yi and Zhou, Denny and Le, Quoc V and Zoph, Barret and Wei, Jason and others},
305
- journal={arXiv preprint arXiv:2301.13688},
306
- year={2023}
307
- }
308
- ```
309
-
310
- ```
311
- @misc{köpf2023openassistant,
312
- title={OpenAssistant Conversations -- Democratizing Large Language Model Alignment},
313
- author={Andreas Köpf and Yannic Kilcher and Dimitri von Rütte and Sotiris Anagnostidis and Zhi-Rui Tam and Keith Stevens and Abdullah Barhoum and Nguyen Minh Duc and Oliver Stanley and Richárd Nagyfi and Shahul ES and Sameer Suri and David Glushkov and Arnav Dantuluri and Andrew Maguire and Christoph Schuhmann and Huu Nguyen and Alexander Mattick},
314
- year={2023},
315
- eprint={2304.07327},
316
- archivePrefix={arXiv},
317
- primaryClass={cs.CL}
318
- }
319
- ```
320
-
321
- ```
322
- @article{peng2023instruction,
323
- title={Instruction Tuning with GPT-4},
324
- author={Peng, Baolin and Li, Chunyuan and He, Pengcheng and Galley, Michel and Gao, Jianfeng},
325
- journal={arXiv preprint arXiv:2304.03277},
326
- year={2023}
327
- }
328
- ```
329
-
330
- ```
331
- @misc{codealpaca,
332
- author = {Sahil Chaudhary},
333
- title = {Code Alpaca: An Instruction-following LLaMA model for code generation},
334
- year = {2023},
335
- publisher = {GitHub},
336
- journal = {GitHub repository},
337
- howpublished = {\url{https://github.com/sahil280114/codealpaca}},
338
- }
339
- ```
 
1
  ---
2
  inference: false
3
  license: other
 
 
 
 
 
 
4
  ---
5
 
6
  <!-- header start -->
 
17
  </div>
18
  <!-- header end -->
19
 
20
+ # Panchovix's merge of Tulu 30B and SuperHOT 8K GPTQ
21
 
22
+ These files are GPTQ 4bit model files for [Panchovix's merge of Tulu 30B and SuperHOT 8K](https://huggingface.co/Panchovix/tulu-30B-SuperHOT-8k).
23
 
24
  It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## Repositories available
27
 
28
  * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Tulu-30B-SuperHOT-8K-GPTQ)
29
+ * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/none)
30
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Tulu-30B-SuperHOT-8K-fp16)
31
 
 
 
32
  ## How to easily download and use this model in text-generation-webui
33
 
34
  Please make sure you're using the latest version of text-generation-webui
35
 
36
  1. Click the **Model tab**.
37
+ 2. Under **Download custom model or LoRA**, enter `TheBloke/Tulu-30B-SuperHOT-8K-GPTQ`.
38
  3. Click **Download**.
39
  4. The model will start downloading. Once it's finished it will say "Done"
40
+ 5. In the top left, click the refresh icon next to **Model**.
41
+ 6. In the **Model** dropdown, choose the model you just downloaded: `Tulu-30B-SuperHOT-8K-GPTQ`
42
+ 7. The model will automatically load, and is now ready for use!
43
+ 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
44
+ * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
45
+ 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
46
 
47
+ ## How to use this GPTQ model from Python code
48
 
49
+ First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
50
 
51
+ `pip install auto-gptq`
 
 
52
 
53
+ Then try the following example code:
 
 
54
 
55
  ```python
56
  from transformers import AutoTokenizer, pipeline, logging
 
67
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
68
  model_basename=model_basename,
69
  use_safetensors=True,
70
+ trust_remote_code=False,
71
+ device="cuda:0",
72
  use_triton=use_triton,
73
  quantize_config=None)
74
 
 
 
75
  # Note: check the prompt template is correct for this model.
76
  prompt = "Tell me about AI"
77
  prompt_template=f'''USER: {prompt}
 
102
  print(pipe(prompt_template)[0]['generated_text'])
103
  ```
104
 
 
 
 
 
 
 
 
105
  ## Provided files
106
 
107
  **tulu-30b-superhot-8k-GPTQ-4bit--1g.act.order.safetensors**
 
111
  It was created without group_size to lower VRAM requirements, and with --act-order (desc_act) to boost inference accuracy as much as possible.
112
 
113
  * `tulu-30b-superhot-8k-GPTQ-4bit--1g.act.order.safetensors`
114
+ * Works with AutoGPTQ in CUDA or Triton modes.
115
+ * LLaMa models also work with [ExLlama](https://github.com/turboderp/exllama}, which usually provides much higher performance, and uses less VRAM, than AutoGPTQ.
116
+ * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
117
  * Works with text-generation-webui, including one-click-installers.
118
  * Parameters: Groupsize = -1. Act Order / desc_act = True.
119
 
 
145
 
146
  <!-- footer end -->
147
 
148
+ # Original model card: Panchovix's merge of Tulu 30B and SuperHOT 8K
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ [tulu-30b](https://huggingface.co/allenai/tulu-30b) merged with kaiokendev's [33b SuperHOT 8k LoRA](https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test), without quant. (Full FP16 model)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -3,13 +3,18 @@
3
  "architectures": [
4
  "LlamaForCausalLM"
5
  ],
 
 
 
 
 
6
  "bos_token_id": 1,
7
  "eos_token_id": 2,
8
  "hidden_act": "silu",
9
  "hidden_size": 6656,
10
  "initializer_range": 0.02,
11
  "intermediate_size": 17920,
12
- "max_position_embeddings": 2048,
13
  "model_type": "llama",
14
  "num_attention_heads": 52,
15
  "num_hidden_layers": 60,
@@ -20,4 +25,4 @@
20
  "transformers_version": "4.30.0.dev0",
21
  "use_cache": true,
22
  "vocab_size": 32001
23
- }
 
3
  "architectures": [
4
  "LlamaForCausalLM"
5
  ],
6
+ "auto_map": {
7
+ "AutoModel": "modelling_llama.LlamaModel",
8
+ "AutoModelForCausalLM": "modelling_llama.LlamaForCausalLM",
9
+ "AutoModelForSequenceClassification": "modelling_llama.LlamaForSequenceClassification"
10
+ },
11
  "bos_token_id": 1,
12
  "eos_token_id": 2,
13
  "hidden_act": "silu",
14
  "hidden_size": 6656,
15
  "initializer_range": 0.02,
16
  "intermediate_size": 17920,
17
+ "max_position_embeddings": 8192,
18
  "model_type": "llama",
19
  "num_attention_heads": 52,
20
  "num_hidden_layers": 60,
 
25
  "transformers_version": "4.30.0.dev0",
26
  "use_cache": true,
27
  "vocab_size": 32001
28
+ }
llama_rope_scaled_monkey_patch.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import transformers
3
+ import transformers.models.llama.modeling_llama
4
+ from einops import rearrange
5
+ import random
6
+
7
+ # This monkey patch file is not needed if using ExLlama, or if using `trust_remote_code=True``
8
+
9
+ class ScaledRotaryEmbedding(torch.nn.Module):
10
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
11
+ super().__init__()
12
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
13
+ self.register_buffer("inv_freq", inv_freq)
14
+
15
+ max_position_embeddings = 8192
16
+
17
+ # Build here to make `torch.jit.trace` work.
18
+ self.max_seq_len_cached = max_position_embeddings
19
+ t = torch.arange(
20
+ self.max_seq_len_cached,
21
+ device=self.inv_freq.device,
22
+ dtype=self.inv_freq.dtype,
23
+ )
24
+
25
+ self.scale = 1 / 4
26
+ t *= self.scale
27
+
28
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
29
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
30
+ emb = torch.cat((freqs, freqs), dim=-1)
31
+ self.register_buffer(
32
+ "cos_cached", emb.cos()[None, None, :, :], persistent=False
33
+ )
34
+ self.register_buffer(
35
+ "sin_cached", emb.sin()[None, None, :, :], persistent=False
36
+ )
37
+
38
+ def forward(self, x, seq_len=None):
39
+ # x: [bs, num_attention_heads, seq_len, head_size]
40
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
41
+ if seq_len > self.max_seq_len_cached:
42
+ self.max_seq_len_cached = seq_len
43
+ t = torch.arange(
44
+ self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype
45
+ )
46
+ t *= self.scale
47
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
48
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
49
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
50
+ self.register_buffer(
51
+ "cos_cached", emb.cos()[None, None, :, :], persistent=False
52
+ )
53
+ self.register_buffer(
54
+ "sin_cached", emb.sin()[None, None, :, :], persistent=False
55
+ )
56
+ return (
57
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
58
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
59
+ )
60
+
61
+
62
+ def replace_llama_rope_with_scaled_rope():
63
+ transformers.models.llama.modeling_llama.LlamaRotaryEmbedding = (
64
+ ScaledRotaryEmbedding
65
+ )
modelling_llama.py ADDED
@@ -0,0 +1,894 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.utils.checkpoint
26
+ from torch import nn
27
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
28
+
29
+ from transformers.activations import ACT2FN
30
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
31
+ from transformers.modeling_utils import PreTrainedModel
32
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
33
+ from transformers.models.llama.modeling_llama import LlamaConfig
34
+
35
+ logger = logging.get_logger(__name__)
36
+
37
+ _CONFIG_FOR_DOC = "LlamaConfig"
38
+
39
+
40
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
41
+ def _make_causal_mask(
42
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
43
+ ):
44
+ """
45
+ Make causal mask used for bi-directional self-attention.
46
+ """
47
+ bsz, tgt_len = input_ids_shape
48
+ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
49
+ mask_cond = torch.arange(mask.size(-1), device=device)
50
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
51
+ mask = mask.to(dtype)
52
+
53
+ if past_key_values_length > 0:
54
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
55
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
56
+
57
+
58
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
59
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
60
+ """
61
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
62
+ """
63
+ bsz, src_len = mask.size()
64
+ tgt_len = tgt_len if tgt_len is not None else src_len
65
+
66
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
67
+
68
+ inverted_mask = 1.0 - expanded_mask
69
+
70
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
71
+
72
+
73
+ class LlamaRMSNorm(nn.Module):
74
+ def __init__(self, hidden_size, eps=1e-6):
75
+ """
76
+ LlamaRMSNorm is equivalent to T5LayerNorm
77
+ """
78
+ super().__init__()
79
+ self.weight = nn.Parameter(torch.ones(hidden_size))
80
+ self.variance_epsilon = eps
81
+
82
+ def forward(self, hidden_states):
83
+ input_dtype = hidden_states.dtype
84
+ variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
85
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
86
+
87
+ return (self.weight * hidden_states).to(input_dtype)
88
+
89
+
90
+ class LlamaRotaryEmbedding(torch.nn.Module):
91
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, scale=1, device=None):
92
+ super().__init__()
93
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
94
+ self.register_buffer("inv_freq", inv_freq)
95
+
96
+ # Build here to make `torch.jit.trace` work.
97
+ self.max_seq_len_cached = max_position_embeddings
98
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
99
+
100
+ self.scale = scale
101
+ t *= self.scale
102
+
103
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
104
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
105
+ emb = torch.cat((freqs, freqs), dim=-1)
106
+ dtype = torch.get_default_dtype()
107
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
108
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
109
+
110
+ def forward(self, x, seq_len=None):
111
+ # x: [bs, num_attention_heads, seq_len, head_size]
112
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
113
+ if seq_len > self.max_seq_len_cached:
114
+ self.max_seq_len_cached = seq_len
115
+ t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
116
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
117
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
118
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
119
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(x.dtype), persistent=False)
120
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(x.dtype), persistent=False)
121
+ return (
122
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
123
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
124
+ )
125
+
126
+
127
+ def rotate_half(x):
128
+ """Rotates half the hidden dims of the input."""
129
+ x1 = x[..., : x.shape[-1] // 2]
130
+ x2 = x[..., x.shape[-1] // 2 :]
131
+ return torch.cat((-x2, x1), dim=-1)
132
+
133
+
134
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
135
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
136
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
137
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
138
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
139
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
140
+ q_embed = (q * cos) + (rotate_half(q) * sin)
141
+ k_embed = (k * cos) + (rotate_half(k) * sin)
142
+ return q_embed, k_embed
143
+
144
+
145
+ class LlamaMLP(nn.Module):
146
+ def __init__(
147
+ self,
148
+ hidden_size: int,
149
+ intermediate_size: int,
150
+ hidden_act: str,
151
+ ):
152
+ super().__init__()
153
+ self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
154
+ self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
155
+ self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
156
+ self.act_fn = ACT2FN[hidden_act]
157
+
158
+ def forward(self, x):
159
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
160
+
161
+
162
+ class LlamaAttention(nn.Module):
163
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
164
+
165
+ def __init__(self, config: LlamaConfig):
166
+ super().__init__()
167
+ self.config = config
168
+ self.hidden_size = config.hidden_size
169
+ self.num_heads = config.num_attention_heads
170
+ self.head_dim = self.hidden_size // self.num_heads
171
+ self.max_position_embeddings = config.max_position_embeddings
172
+ self.position_embeddings_scale = 2048 / self.max_position_embeddings
173
+
174
+ if (self.head_dim * self.num_heads) != self.hidden_size:
175
+ raise ValueError(
176
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
177
+ f" and `num_heads`: {self.num_heads})."
178
+ )
179
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
180
+ self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
181
+ self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
182
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
183
+ self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings, scale=self.position_embeddings_scale)
184
+
185
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
186
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
187
+
188
+ def forward(
189
+ self,
190
+ hidden_states: torch.Tensor,
191
+ attention_mask: Optional[torch.Tensor] = None,
192
+ position_ids: Optional[torch.LongTensor] = None,
193
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
194
+ output_attentions: bool = False,
195
+ use_cache: bool = False,
196
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
197
+ bsz, q_len, _ = hidden_states.size()
198
+
199
+ query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
200
+ key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
201
+ value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
202
+
203
+ kv_seq_len = key_states.shape[-2]
204
+ if past_key_value is not None:
205
+ kv_seq_len += past_key_value[0].shape[-2]
206
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
207
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
208
+ # [bsz, nh, t, hd]
209
+
210
+ if past_key_value is not None:
211
+ # reuse k, v, self_attention
212
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
213
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
214
+
215
+ past_key_value = (key_states, value_states) if use_cache else None
216
+
217
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
218
+
219
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
220
+ raise ValueError(
221
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
222
+ f" {attn_weights.size()}"
223
+ )
224
+
225
+ if attention_mask is not None:
226
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
227
+ raise ValueError(
228
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
229
+ )
230
+ attn_weights = attn_weights + attention_mask
231
+ attn_weights = torch.max(
232
+ attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
233
+ )
234
+
235
+ # upcast attention to fp32
236
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
237
+ attn_output = torch.matmul(attn_weights, value_states)
238
+
239
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
240
+ raise ValueError(
241
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
242
+ f" {attn_output.size()}"
243
+ )
244
+
245
+ attn_output = attn_output.transpose(1, 2)
246
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
247
+
248
+ attn_output = self.o_proj(attn_output)
249
+
250
+ if not output_attentions:
251
+ attn_weights = None
252
+
253
+ return attn_output, attn_weights, past_key_value
254
+
255
+
256
+ class LlamaDecoderLayer(nn.Module):
257
+ def __init__(self, config: LlamaConfig):
258
+ super().__init__()
259
+ self.hidden_size = config.hidden_size
260
+ self.self_attn = LlamaAttention(config=config)
261
+ self.mlp = LlamaMLP(
262
+ hidden_size=self.hidden_size,
263
+ intermediate_size=config.intermediate_size,
264
+ hidden_act=config.hidden_act,
265
+ )
266
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
267
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
268
+
269
+ def forward(
270
+ self,
271
+ hidden_states: torch.Tensor,
272
+ attention_mask: Optional[torch.Tensor] = None,
273
+ position_ids: Optional[torch.LongTensor] = None,
274
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
275
+ output_attentions: Optional[bool] = False,
276
+ use_cache: Optional[bool] = False,
277
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
278
+ """
279
+ Args:
280
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
281
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
282
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
283
+ output_attentions (`bool`, *optional*):
284
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
285
+ returned tensors for more detail.
286
+ use_cache (`bool`, *optional*):
287
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
288
+ (see `past_key_values`).
289
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
290
+ """
291
+
292
+ residual = hidden_states
293
+
294
+ hidden_states = self.input_layernorm(hidden_states)
295
+
296
+ # Self Attention
297
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
298
+ hidden_states=hidden_states,
299
+ attention_mask=attention_mask,
300
+ position_ids=position_ids,
301
+ past_key_value=past_key_value,
302
+ output_attentions=output_attentions,
303
+ use_cache=use_cache,
304
+ )
305
+ hidden_states = residual + hidden_states
306
+
307
+ # Fully Connected
308
+ residual = hidden_states
309
+ hidden_states = self.post_attention_layernorm(hidden_states)
310
+ hidden_states = self.mlp(hidden_states)
311
+ hidden_states = residual + hidden_states
312
+
313
+ outputs = (hidden_states,)
314
+
315
+ if output_attentions:
316
+ outputs += (self_attn_weights,)
317
+
318
+ if use_cache:
319
+ outputs += (present_key_value,)
320
+
321
+ return outputs
322
+
323
+
324
+ LLAMA_START_DOCSTRING = r"""
325
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
326
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
327
+ etc.)
328
+
329
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
330
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
331
+ and behavior.
332
+
333
+ Parameters:
334
+ config ([`LlamaConfig`]):
335
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
336
+ load the weights associated with the model, only the configuration. Check out the
337
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
338
+ """
339
+
340
+
341
+ @add_start_docstrings(
342
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
343
+ LLAMA_START_DOCSTRING,
344
+ )
345
+ class LlamaPreTrainedModel(PreTrainedModel):
346
+ config_class = LlamaConfig
347
+ base_model_prefix = "model"
348
+ supports_gradient_checkpointing = True
349
+ _no_split_modules = ["LlamaDecoderLayer"]
350
+ _skip_keys_device_placement = "past_key_values"
351
+ _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
352
+
353
+ def _init_weights(self, module):
354
+ std = self.config.initializer_range
355
+ if isinstance(module, nn.Linear):
356
+ module.weight.data.normal_(mean=0.0, std=std)
357
+ if module.bias is not None:
358
+ module.bias.data.zero_()
359
+ elif isinstance(module, nn.Embedding):
360
+ module.weight.data.normal_(mean=0.0, std=std)
361
+ if module.padding_idx is not None:
362
+ module.weight.data[module.padding_idx].zero_()
363
+
364
+ def _set_gradient_checkpointing(self, module, value=False):
365
+ if isinstance(module, LlamaModel):
366
+ module.gradient_checkpointing = value
367
+
368
+
369
+ LLAMA_INPUTS_DOCSTRING = r"""
370
+ Args:
371
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
372
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
373
+ it.
374
+
375
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
376
+ [`PreTrainedTokenizer.__call__`] for details.
377
+
378
+ [What are input IDs?](../glossary#input-ids)
379
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
380
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
381
+
382
+ - 1 for tokens that are **not masked**,
383
+ - 0 for tokens that are **masked**.
384
+
385
+ [What are attention masks?](../glossary#attention-mask)
386
+
387
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
388
+ [`PreTrainedTokenizer.__call__`] for details.
389
+
390
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
391
+ `past_key_values`).
392
+
393
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
394
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
395
+ information on the default strategy.
396
+
397
+ - 1 indicates the head is **not masked**,
398
+ - 0 indicates the head is **masked**.
399
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
400
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
401
+ config.n_positions - 1]`.
402
+
403
+ [What are position IDs?](../glossary#position-ids)
404
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
405
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
406
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
407
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
408
+
409
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
410
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
411
+
412
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
413
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
414
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
415
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
416
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
417
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
418
+ model's internal embedding lookup matrix.
419
+ use_cache (`bool`, *optional*):
420
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
421
+ `past_key_values`).
422
+ output_attentions (`bool`, *optional*):
423
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
424
+ tensors for more detail.
425
+ output_hidden_states (`bool`, *optional*):
426
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
427
+ more detail.
428
+ return_dict (`bool`, *optional*):
429
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
430
+ """
431
+
432
+
433
+ @add_start_docstrings(
434
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
435
+ LLAMA_START_DOCSTRING,
436
+ )
437
+ class LlamaModel(LlamaPreTrainedModel):
438
+ """
439
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
440
+
441
+ Args:
442
+ config: LlamaConfig
443
+ """
444
+
445
+ def __init__(self, config: LlamaConfig):
446
+ super().__init__(config)
447
+ self.padding_idx = config.pad_token_id
448
+ self.vocab_size = config.vocab_size
449
+
450
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
451
+ self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
452
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
453
+
454
+ self.gradient_checkpointing = False
455
+ # Initialize weights and apply final processing
456
+ self.post_init()
457
+
458
+ def get_input_embeddings(self):
459
+ return self.embed_tokens
460
+
461
+ def set_input_embeddings(self, value):
462
+ self.embed_tokens = value
463
+
464
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
465
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
466
+ # create causal mask
467
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
468
+ combined_attention_mask = None
469
+ if input_shape[-1] > 1:
470
+ combined_attention_mask = _make_causal_mask(
471
+ input_shape,
472
+ inputs_embeds.dtype,
473
+ device=inputs_embeds.device,
474
+ past_key_values_length=past_key_values_length,
475
+ )
476
+
477
+ if attention_mask is not None:
478
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
479
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
480
+ inputs_embeds.device
481
+ )
482
+ combined_attention_mask = (
483
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
484
+ )
485
+
486
+ return combined_attention_mask
487
+
488
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
489
+ def forward(
490
+ self,
491
+ input_ids: torch.LongTensor = None,
492
+ attention_mask: Optional[torch.Tensor] = None,
493
+ position_ids: Optional[torch.LongTensor] = None,
494
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
495
+ inputs_embeds: Optional[torch.FloatTensor] = None,
496
+ use_cache: Optional[bool] = None,
497
+ output_attentions: Optional[bool] = None,
498
+ output_hidden_states: Optional[bool] = None,
499
+ return_dict: Optional[bool] = None,
500
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
501
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
502
+ output_hidden_states = (
503
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
504
+ )
505
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
506
+
507
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
508
+
509
+ # retrieve input_ids and inputs_embeds
510
+ if input_ids is not None and inputs_embeds is not None:
511
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
512
+ elif input_ids is not None:
513
+ batch_size, seq_length = input_ids.shape
514
+ elif inputs_embeds is not None:
515
+ batch_size, seq_length, _ = inputs_embeds.shape
516
+ else:
517
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
518
+
519
+ seq_length_with_past = seq_length
520
+ past_key_values_length = 0
521
+
522
+ if past_key_values is not None:
523
+ past_key_values_length = past_key_values[0][0].shape[2]
524
+ seq_length_with_past = seq_length_with_past + past_key_values_length
525
+
526
+ if position_ids is None:
527
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
528
+ position_ids = torch.arange(
529
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
530
+ )
531
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
532
+ else:
533
+ position_ids = position_ids.view(-1, seq_length).long()
534
+
535
+ if inputs_embeds is None:
536
+ inputs_embeds = self.embed_tokens(input_ids)
537
+ # embed positions
538
+ if attention_mask is None:
539
+ attention_mask = torch.ones(
540
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
541
+ )
542
+ attention_mask = self._prepare_decoder_attention_mask(
543
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
544
+ )
545
+
546
+ hidden_states = inputs_embeds
547
+
548
+ if self.gradient_checkpointing and self.training:
549
+ if use_cache:
550
+ logger.warning_once(
551
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
552
+ )
553
+ use_cache = False
554
+
555
+ # decoder layers
556
+ all_hidden_states = () if output_hidden_states else None
557
+ all_self_attns = () if output_attentions else None
558
+ next_decoder_cache = () if use_cache else None
559
+
560
+ for idx, decoder_layer in enumerate(self.layers):
561
+ if output_hidden_states:
562
+ all_hidden_states += (hidden_states,)
563
+
564
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
565
+
566
+ if self.gradient_checkpointing and self.training:
567
+
568
+ def create_custom_forward(module):
569
+ def custom_forward(*inputs):
570
+ # None for past_key_value
571
+ return module(*inputs, output_attentions, None)
572
+
573
+ return custom_forward
574
+
575
+ layer_outputs = torch.utils.checkpoint.checkpoint(
576
+ create_custom_forward(decoder_layer),
577
+ hidden_states,
578
+ attention_mask,
579
+ position_ids,
580
+ None,
581
+ )
582
+ else:
583
+ layer_outputs = decoder_layer(
584
+ hidden_states,
585
+ attention_mask=attention_mask,
586
+ position_ids=position_ids,
587
+ past_key_value=past_key_value,
588
+ output_attentions=output_attentions,
589
+ use_cache=use_cache,
590
+ )
591
+
592
+ hidden_states = layer_outputs[0]
593
+
594
+ if use_cache:
595
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
596
+
597
+ if output_attentions:
598
+ all_self_attns += (layer_outputs[1],)
599
+
600
+ hidden_states = self.norm(hidden_states)
601
+
602
+ # add hidden states from the last decoder layer
603
+ if output_hidden_states:
604
+ all_hidden_states += (hidden_states,)
605
+
606
+ next_cache = next_decoder_cache if use_cache else None
607
+ if not return_dict:
608
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
609
+ return BaseModelOutputWithPast(
610
+ last_hidden_state=hidden_states,
611
+ past_key_values=next_cache,
612
+ hidden_states=all_hidden_states,
613
+ attentions=all_self_attns,
614
+ )
615
+
616
+
617
+ class LlamaForCausalLM(LlamaPreTrainedModel):
618
+ _tied_weights_keys = ["lm_head.weight"]
619
+
620
+ def __init__(self, config):
621
+ super().__init__(config)
622
+ self.model = LlamaModel(config)
623
+
624
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
625
+
626
+ # Initialize weights and apply final processing
627
+ self.post_init()
628
+
629
+ def get_input_embeddings(self):
630
+ return self.model.embed_tokens
631
+
632
+ def set_input_embeddings(self, value):
633
+ self.model.embed_tokens = value
634
+
635
+ def get_output_embeddings(self):
636
+ return self.lm_head
637
+
638
+ def set_output_embeddings(self, new_embeddings):
639
+ self.lm_head = new_embeddings
640
+
641
+ def set_decoder(self, decoder):
642
+ self.model = decoder
643
+
644
+ def get_decoder(self):
645
+ return self.model
646
+
647
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
648
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
649
+ def forward(
650
+ self,
651
+ input_ids: torch.LongTensor = None,
652
+ attention_mask: Optional[torch.Tensor] = None,
653
+ position_ids: Optional[torch.LongTensor] = None,
654
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
655
+ inputs_embeds: Optional[torch.FloatTensor] = None,
656
+ labels: Optional[torch.LongTensor] = None,
657
+ use_cache: Optional[bool] = None,
658
+ output_attentions: Optional[bool] = None,
659
+ output_hidden_states: Optional[bool] = None,
660
+ return_dict: Optional[bool] = None,
661
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
662
+ r"""
663
+ Args:
664
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
665
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
666
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
667
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
668
+
669
+ Returns:
670
+
671
+ Example:
672
+
673
+ ```python
674
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
675
+
676
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
677
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
678
+
679
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
680
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
681
+
682
+ >>> # Generate
683
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
684
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
685
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
686
+ ```"""
687
+
688
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
689
+ output_hidden_states = (
690
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
691
+ )
692
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
693
+
694
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
695
+ outputs = self.model(
696
+ input_ids=input_ids,
697
+ attention_mask=attention_mask,
698
+ position_ids=position_ids,
699
+ past_key_values=past_key_values,
700
+ inputs_embeds=inputs_embeds,
701
+ use_cache=use_cache,
702
+ output_attentions=output_attentions,
703
+ output_hidden_states=output_hidden_states,
704
+ return_dict=return_dict,
705
+ )
706
+
707
+ hidden_states = outputs[0]
708
+ logits = self.lm_head(hidden_states)
709
+
710
+ loss = None
711
+ if labels is not None:
712
+ # Shift so that tokens < n predict n
713
+ shift_logits = logits[..., :-1, :].contiguous()
714
+ shift_labels = labels[..., 1:].contiguous()
715
+ # Flatten the tokens
716
+ loss_fct = CrossEntropyLoss()
717
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
718
+ shift_labels = shift_labels.view(-1)
719
+ # Enable model parallelism
720
+ shift_labels = shift_labels.to(shift_logits.device)
721
+ loss = loss_fct(shift_logits, shift_labels)
722
+
723
+ if not return_dict:
724
+ output = (logits,) + outputs[1:]
725
+ return (loss,) + output if loss is not None else output
726
+
727
+ return CausalLMOutputWithPast(
728
+ loss=loss,
729
+ logits=logits,
730
+ past_key_values=outputs.past_key_values,
731
+ hidden_states=outputs.hidden_states,
732
+ attentions=outputs.attentions,
733
+ )
734
+
735
+ def prepare_inputs_for_generation(
736
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
737
+ ):
738
+ if past_key_values:
739
+ input_ids = input_ids[:, -1:]
740
+
741
+ position_ids = kwargs.get("position_ids", None)
742
+ if attention_mask is not None and position_ids is None:
743
+ # create position_ids on the fly for batch generation
744
+ position_ids = attention_mask.long().cumsum(-1) - 1
745
+ position_ids.masked_fill_(attention_mask == 0, 1)
746
+ if past_key_values:
747
+ position_ids = position_ids[:, -1].unsqueeze(-1)
748
+
749
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
750
+ if inputs_embeds is not None and past_key_values is None:
751
+ model_inputs = {"inputs_embeds": inputs_embeds}
752
+ else:
753
+ model_inputs = {"input_ids": input_ids}
754
+
755
+ model_inputs.update(
756
+ {
757
+ "position_ids": position_ids,
758
+ "past_key_values": past_key_values,
759
+ "use_cache": kwargs.get("use_cache"),
760
+ "attention_mask": attention_mask,
761
+ }
762
+ )
763
+ return model_inputs
764
+
765
+ @staticmethod
766
+ def _reorder_cache(past_key_values, beam_idx):
767
+ reordered_past = ()
768
+ for layer_past in past_key_values:
769
+ reordered_past += (
770
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
771
+ )
772
+ return reordered_past
773
+
774
+
775
+ @add_start_docstrings(
776
+ """
777
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
778
+
779
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
780
+ (e.g. GPT-2) do.
781
+
782
+ Since it does classification on the last token, it requires to know the position of the last token. If a
783
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
784
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
785
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
786
+ each row of the batch).
787
+ """,
788
+ LLAMA_START_DOCSTRING,
789
+ )
790
+ class LlamaForSequenceClassification(LlamaPreTrainedModel):
791
+ _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
792
+
793
+ def __init__(self, config):
794
+ super().__init__(config)
795
+ self.num_labels = config.num_labels
796
+ self.model = LlamaModel(config)
797
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
798
+
799
+ # Initialize weights and apply final processing
800
+ self.post_init()
801
+
802
+ def get_input_embeddings(self):
803
+ return self.model.embed_tokens
804
+
805
+ def set_input_embeddings(self, value):
806
+ self.model.embed_tokens = value
807
+
808
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
809
+ def forward(
810
+ self,
811
+ input_ids: torch.LongTensor = None,
812
+ attention_mask: Optional[torch.Tensor] = None,
813
+ position_ids: Optional[torch.LongTensor] = None,
814
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
815
+ inputs_embeds: Optional[torch.FloatTensor] = None,
816
+ labels: Optional[torch.LongTensor] = None,
817
+ use_cache: Optional[bool] = None,
818
+ output_attentions: Optional[bool] = None,
819
+ output_hidden_states: Optional[bool] = None,
820
+ return_dict: Optional[bool] = None,
821
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
822
+ r"""
823
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
824
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
825
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
826
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
827
+ """
828
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
829
+
830
+ transformer_outputs = self.model(
831
+ input_ids,
832
+ attention_mask=attention_mask,
833
+ position_ids=position_ids,
834
+ past_key_values=past_key_values,
835
+ inputs_embeds=inputs_embeds,
836
+ use_cache=use_cache,
837
+ output_attentions=output_attentions,
838
+ output_hidden_states=output_hidden_states,
839
+ return_dict=return_dict,
840
+ )
841
+ hidden_states = transformer_outputs[0]
842
+ logits = self.score(hidden_states)
843
+
844
+ if input_ids is not None:
845
+ batch_size = input_ids.shape[0]
846
+ else:
847
+ batch_size = inputs_embeds.shape[0]
848
+
849
+ if self.config.pad_token_id is None and batch_size != 1:
850
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
851
+ if self.config.pad_token_id is None:
852
+ sequence_lengths = -1
853
+ else:
854
+ if input_ids is not None:
855
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
856
+ else:
857
+ sequence_lengths = -1
858
+
859
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
860
+
861
+ loss = None
862
+ if labels is not None:
863
+ labels = labels.to(logits.device)
864
+ if self.config.problem_type is None:
865
+ if self.num_labels == 1:
866
+ self.config.problem_type = "regression"
867
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
868
+ self.config.problem_type = "single_label_classification"
869
+ else:
870
+ self.config.problem_type = "multi_label_classification"
871
+
872
+ if self.config.problem_type == "regression":
873
+ loss_fct = MSELoss()
874
+ if self.num_labels == 1:
875
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
876
+ else:
877
+ loss = loss_fct(pooled_logits, labels)
878
+ elif self.config.problem_type == "single_label_classification":
879
+ loss_fct = CrossEntropyLoss()
880
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
881
+ elif self.config.problem_type == "multi_label_classification":
882
+ loss_fct = BCEWithLogitsLoss()
883
+ loss = loss_fct(pooled_logits, labels)
884
+ if not return_dict:
885
+ output = (pooled_logits,) + transformer_outputs[1:]
886
+ return ((loss,) + output) if loss is not None else output
887
+
888
+ return SequenceClassifierOutputWithPast(
889
+ loss=loss,
890
+ logits=pooled_logits,
891
+ past_key_values=transformer_outputs.past_key_values,
892
+ hidden_states=transformer_outputs.hidden_states,
893
+ attentions=transformer_outputs.attentions,
894
+ )