Jamie@TitanML commited on
Commit
dade5bc
1 Parent(s): 418e096

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - LeoLM/OpenSchnabeltier
4
+ - OpenAssistant/OASST-DE
5
+ - FreedomIntelligence/alpaca-gpt4-deutsch
6
+ - FreedomIntelligence/evol-instruct-deutsch
7
+ - LeoLM/German_Poems
8
+ - LeoLM/German_Songs
9
+ language:
10
+ - en
11
+ - de
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
14
+ ---
15
+ # LAION LeoLM: **L**inguistically **E**nhanced **O**pen **L**anguage **M**odel
16
+ Meet LeoLM, the first open and commercially available German Foundation Language Model built on Llama-2.
17
+ Our models extend Llama-2's capabilities into German through continued pretraining on a large corpus of German-language and mostly locality specific text.
18
+ Thanks to a compute grant at HessianAI's new supercomputer **42**, we release two foundation models trained with 8k context length,
19
+ [`LeoLM/leo-hessianai-7b`](https://huggingface.co/LeoLM/leo-hessianai-7b) and [`LeoLM/leo-hessianai-13b`](https://huggingface.co/LeoLM/leo-hessianai-13b) under the [Llama-2 community license](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt) (70b also coming soon! 👀).
20
+ With this release, we hope to bring a new wave of opportunities to German open-source and commercial LLM research and accelerate adoption.
21
+ Read our [blog post]() or our paper (preprint coming soon) for more details!
22
+
23
+ *A project by Björn Plüster and Christoph Schuhmann in collaboration with LAION and HessianAI.*
24
+
25
+ ## LeoLM Chat
26
+ `LeoLM/leo-hessianai-13b-chat` is a German chat model built on our foundation model `LeoLM/leo-hessianai-13b` and finetuned on a selection of German instruction datasets.
27
+ The model performs exceptionally well on writing, explanation and discussion tasks but struggles somewhat with math and advanced reasoning. See our MT-Bench-DE scores:
28
+ ```
29
+ {
30
+ "first_turn": 6.525,
31
+ "second_turn": 5.15,
32
+ "categories": {
33
+ "writing": 6.925,
34
+ "roleplay": 6.7,
35
+ "reasoning": 4.55,
36
+ "math": 3.25,
37
+ "coding": 3.45,
38
+ "extraction": 5.4,
39
+ "stem": 7.55,
40
+ "humanities": 8.875
41
+ },
42
+ "average": 5.8375
43
+ }
44
+ ```
45
+
46
+ ## Model Details
47
+
48
+ - **Finetuned from:** [LeoLM/leo-hessianai-13b](https://huggingface.co/LeoLM/leo-hessianai-7b)
49
+ - **Model type:** Causal decoder-only transformer language model
50
+ - **Language:** English and German
51
+ - **Demo:** [Web Demo]()
52
+ - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
53
+ - **Contact:** [LAION Discord](https://discord.com/invite/eq3cAMZtCC) or [Björn Plüster](mailto:bjoern.pl@outlook.de)
54
+
55
+
56
+ ## Use in 🤗Transformers
57
+ First install direct dependencies:
58
+ ```
59
+ pip install transformers torch sentencepiece
60
+ ```
61
+ If you want faster inference using flash-attention2, you need to install these dependencies:
62
+ ```bash
63
+ pip install packaging ninja
64
+ pip install flash-attn==v2.1.1 --no-build-isolation
65
+ pip install git+https://github.com/HazyResearch/flash-attention.git@v2.1.1#subdirectory=csrc/rotary
66
+ ```
67
+ Then load the model in transformers:
68
+ ```python
69
+ from transformers import pipeline
70
+ import torch
71
+
72
+ system_prompt = """<|im_start|>system
73
+ Dies ist eine Unterhaltung zwischen einem intelligenten, hilfsbereitem KI-Assistenten und einem Nutzer.
74
+ Der Assistent gibt ausführliche, hilfreiche und ehrliche Antworten.<|im_end|>
75
+
76
+ """
77
+ prompt_format = "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
78
+ prompt = "Erkläre mir wie die Fahrradwegesituation in Hamburg ist."
79
+
80
+ generator = pipeline(model="LeoLM/leo-hessianai-13b-chat", device="cuda", torch_dtype=torch.float16, trust_remote_code=True) # True for flash-attn2 else False
81
+ print(generator(prompt_format.format(prompt=prompt), do_sample=True, top_p=0.95, max_length=8192))
82
+ ```
83
+
84
+ "*Hamburg verfügt über ein umfangreiches Netzwerk an Radwegen, das ein effizientes und angenehmes Radfahren in der ganzen Stadt ermöglicht. Die Radwege variieren in Bezug auf ihre Qualität, wobei einige erstklassig mit eigens für Radfahrer reservierten Spuren sind, während andere einfache Fahrradsymbole auf dem Boden haben, die anzeigen, dass Radfahrer abwechselnd mit dem Autoverkehr auf der Straße fahren sollten. Einige Nebenstraßen haben auch spezielle Fahrradspuren, wobei einige mit Bordsteinabsenkungen zur Seite der Autospuren markiert sind. Zusätzlich haben viele Hauptstraßen, insbesondere in Nebenstraßen, fahrradfreundliche Abstände zwischen den geparkten Autos und dem Gehweg, was ein bequemes Fahren auf der Straße ermöglicht. Der Bau von Radschnellwegen, die schnelles und effizientes Radfahren in und aus der Stadt ermöglichen, ist im Gange und wird in den kommenden Jahren fortgesetzt. Insgesamt sind die Radwege in Hamburg weitläufig und gut ausgeschildert, was es zu einem angenehmen Ort macht, um mit dem Fahrrad zu fahren.*"
85
+
86
+ ## Prompting / Prompt Template
87
+
88
+ Prompt dialogue template (ChatML format):
89
+
90
+ ```
91
+ """
92
+ <|im_start|>system
93
+ {system_message}<|im_end|>
94
+ <|im_start|>user
95
+ {prompt}<|im_end|>
96
+ <|im_start|>assistant
97
+ """
98
+ ```
99
+
100
+ The model input can contain multiple conversation turns between user and assistant, e.g.
101
+ ```
102
+ <|im_start|>user
103
+ {prompt 1}<|im_end|>
104
+ <|im_start|>assistant
105
+ {reply 1}<|im_end|>
106
+ <|im_start|>user
107
+ {prompt 2}<|im_end|>
108
+ <|im_start|>assistant
109
+ (...)
110
+ ```
111
+
112
+ ## Ethical Considerations and Limitations
113
+
114
+ LeoLM has been tested in English and German, and has not covered, nor could it cover all scenarios.
115
+ For these reasons, as with all LLMs, the potential outputs of `LeoLM/leo-hessianai-13b-chat` cannot be predicted
116
+ in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
117
+ to user prompts. Therefore, before deploying any applications of `LeoLM/leo-hessianai-13b-chat`, developers should
118
+ perform safety testing and tuning tailored to their specific applications of the model.
119
+
120
+ Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
121
+
122
+ ## Finetuning Details
123
+
124
+ | Hyperparameter | Value |
125
+ |---|---|
126
+ | Num epochs | 3 |
127
+ | Examples per epoch | 131214 |
128
+ | Global batch size | 256 |
129
+ | Learning rate | 3e-5 |
130
+ | Warmup steps | 100 |
131
+ | LR scheduler | Cosine |
132
+ | Adam betas | (0.9, 0.95) |
133
+
134
+
135
+ ## Dataset Details
136
+ ```
137
+ ## Stats for 'Subset of OpenAssistant/OASST-DE' (3534 samples (100.0%))
138
+ -----------------
139
+ Accepted: 3534/3534 (100.0%)
140
+ Accepted tokens: 2259302
141
+ Skipped: 0 (0.0%)
142
+ Min tokens per sample: 29
143
+ Max tokens per sample: 2484
144
+ Avg tokens per sample: 639.3044708545557
145
+ -----------------
146
+
147
+ ## Stats for 'Subset of FreedomIntelligence/evol-instruct-deutsch' (57841 samples (100.0%))
148
+ -----------------
149
+ Accepted: 57841/57841 (100.0%)
150
+ Accepted tokens: 42958192
151
+ Skipped: 0 (0.0%)
152
+ Min tokens per sample: 33
153
+ Max tokens per sample: 5507
154
+ Avg tokens per sample: 742.6944900675991
155
+ -----------------
156
+
157
+ ## Stats for 'Subset of FreedomIntelligence/alpaca-gpt4-deutsch' (48969 samples (100.0%))
158
+ -----------------
159
+ Accepted: 48969/48969 (100.0%)
160
+ Accepted tokens: 13372005
161
+ Skipped: 0 (0.0%)
162
+ Min tokens per sample: 19
163
+ Max tokens per sample: 1359
164
+ Avg tokens per sample: 273.07082031489307
165
+ -----------------
166
+
167
+ ## Stats for 'Subset of LeoLM/OpenSchnabeltier' (21314 samples (100.0%))
168
+ -----------------
169
+ Accepted: 21314/21314 (100.0%)
170
+ Accepted tokens: 8134690
171
+ Skipped: 0 (0.0%)
172
+ Min tokens per sample: 25
173
+ Max tokens per sample: 1202
174
+ Avg tokens per sample: 381.65947264708643
175
+ -----------------
176
+
177
+ ## Stats for 'Subset of LeoLM/German_Poems' (490 samples (100.0%))
178
+ -----------------
179
+ Accepted: 490/490 (100.0%)
180
+ Accepted tokens: 618642
181
+ Skipped: 0 (0.0%)
182
+ Min tokens per sample: 747
183
+ Max tokens per sample: 1678
184
+ Avg tokens per sample: 1262.534693877551
185
+ -----------------
186
+
187
+ ## Stats for 'Subset of LeoLM/German_Songs' (392 samples (100.0%))
188
+ -----------------
189
+ Accepted: 392/392 (100.0%)
190
+ Accepted tokens: 187897
191
+ Skipped: 0 (0.0%)
192
+ Min tokens per sample: 231
193
+ Max tokens per sample: 826
194
+ Avg tokens per sample: 479.3290816326531
195
+ -----------------
196
+
197
+ ## Stats for 'total' (132540 samples (100.0%))
198
+ -----------------
199
+ Accepted: 132540/132540 (100.0%)
200
+ Accepted tokens: 67530728
201
+ Skipped: 0 (0.0%)
202
+ Min tokens per sample: 19
203
+ Max tokens per sample: 5507
204
+ Avg tokens per sample: 509.51205673758864
205
+ -----------------
206
+ ```
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoModelForCausalLM": "modeling_flash_llama.LlamaForCausalLM"
7
+ },
8
+ "bos_token_id": 1,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 5120,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 13824,
14
+ "max_position_embeddings": 8192,
15
+ "model_type": "llama",
16
+ "num_attention_heads": 40,
17
+ "num_hidden_layers": 40,
18
+ "num_key_value_heads": 40,
19
+ "pad_token_id": 0,
20
+ "pretraining_tp": 1,
21
+ "rms_norm_eps": 1e-05,
22
+ "rope_scaling": {
23
+ "factor": 2.0,
24
+ "type": "linear"
25
+ },
26
+ "rope_theta": 10000.0,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float16",
29
+ "transformers_version": "4.33.1",
30
+ "use_cache": true,
31
+ "vocab_size": 32128
32
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.33.1"
6
+ }
modeling_flash_llama.py ADDED
@@ -0,0 +1,1010 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # From https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/modeling_flash_llama.py
3
+ # With fix from Alex Birch: https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/discussions/17
4
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
5
+ #
6
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
7
+ # and OPT implementations in this library. It has been modified from its
8
+ # original forms to accommodate minor architectural differences compared
9
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
10
+ #
11
+ # Licensed under the Apache License, Version 2.0 (the "License");
12
+ # you may not use this file except in compliance with the License.
13
+ # You may obtain a copy of the License at
14
+ #
15
+ # http://www.apache.org/licenses/LICENSE-2.0
16
+ #
17
+ # Unless required by applicable law or agreed to in writing, software
18
+ # distributed under the License is distributed on an "AS IS" BASIS,
19
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
20
+ # See the License for the specific language governing permissions and
21
+ # limitations under the License.
22
+ """ PyTorch LLaMA model."""
23
+ from typing import List, Optional, Tuple, Union
24
+
25
+ import torch
26
+ import torch.nn.functional as F
27
+ import torch.utils.checkpoint
28
+ from torch import nn
29
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
30
+
31
+ from transformers.activations import ACT2FN
32
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
35
+ from transformers.models.llama.configuration_llama import LlamaConfig
36
+
37
+
38
+ try:
39
+ from flash_attn.flash_attn_interface import (
40
+ flash_attn_kvpacked_func,
41
+ flash_attn_varlen_kvpacked_func,
42
+ )
43
+ from flash_attn.bert_padding import unpad_input, pad_input
44
+ flash_attn_v2_installed = True
45
+ print('>>>> Flash Attention installed')
46
+ except ImportError:
47
+ flash_attn_v2_installed = False
48
+ raise ImportError('Please install Flash Attention: `pip install flash-attn --no-build-isolation`')
49
+
50
+ try:
51
+ from flash_attn.layers.rotary import apply_rotary_emb_func
52
+ flash_rope_installed = True
53
+ print('>>>> Flash RoPE installed')
54
+ except ImportError:
55
+ flash_rope_installed = False
56
+ raise ImportError('Please install RoPE kernels: `pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary`')
57
+
58
+
59
+ logger = logging.get_logger(__name__)
60
+
61
+ _CONFIG_FOR_DOC = "LlamaConfig"
62
+
63
+
64
+ # @torch.jit.script
65
+ def rmsnorm_func(hidden_states, weight, variance_epsilon):
66
+ input_dtype = hidden_states.dtype
67
+ hidden_states = hidden_states.to(torch.float32)
68
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
69
+ hidden_states = hidden_states * torch.rsqrt(variance + variance_epsilon)
70
+ return (weight * hidden_states).to(input_dtype)
71
+
72
+
73
+ class LlamaRMSNorm(nn.Module):
74
+ def __init__(self, hidden_size, eps=1e-6):
75
+ """
76
+ LlamaRMSNorm is equivalent to T5LayerNorm
77
+ """
78
+ super().__init__()
79
+ self.weight = nn.Parameter(torch.ones(hidden_size))
80
+ self.register_buffer(
81
+ "variance_epsilon",
82
+ torch.tensor(eps),
83
+ persistent=False,
84
+ )
85
+
86
+ def forward(self, hidden_states):
87
+ return rmsnorm_func(hidden_states, self.weight, self.variance_epsilon)
88
+
89
+
90
+ class FlashRotaryEmbedding(torch.nn.Module):
91
+ """
92
+ The rotary position embeddings from RoFormer_ (Su et. al).
93
+ A crucial insight from the method is that the query and keys are
94
+ transformed by rotation matrices which depend on the relative positions.
95
+
96
+ Other implementations are available in the Rotary Transformer repo_ and in
97
+ GPT-NeoX_, GPT-NeoX was an inspiration
98
+
99
+ .. _RoFormer: https://arxiv.org/abs/2104.09864
100
+ .. _repo: https://github.com/ZhuiyiTechnology/roformer
101
+ .. _GPT-NeoX: https://github.com/EleutherAI/gpt-neox
102
+
103
+ If scale_base is not None, this implements XPos (Sun et al., https://arxiv.org/abs/2212.10554).
104
+ A recommended value for scale_base is 512: https://github.com/HazyResearch/flash-attention/issues/96
105
+ Reference: https://github.com/sunyt32/torchscale/blob/main/torchscale/component/xpos_relative_position.py
106
+ """
107
+
108
+ def __init__(self, dim: int, base=10000.0, interleaved=False, scale_base=None,
109
+ scaling_factor=1.0, pos_idx_in_fp32=True, device=None):
110
+ """
111
+ interleaved: if True, rotate pairs of even and odd dimensions (GPT-J style) instead
112
+ of 1st half and 2nd half (GPT-NeoX style).
113
+ pos_idx_in_fp32: if True, the position indices [0.0, ..., seqlen - 1] are in fp32,
114
+ otherwise they might be in lower precision.
115
+ This option was added because previously (before 2023-07-02), when we construct
116
+ the position indices, we use the dtype of self.inv_freq. In most cases this would
117
+ be fp32, but if the model is trained in pure bf16 (not mixed precision), then
118
+ self.inv_freq would be bf16, and the position indices are also in bf16.
119
+ Because of the limited precision of bf16 (e.g. 1995.0 is rounded to 2000.0), the
120
+ embeddings for some positions will coincide.
121
+ To maintain compatibility with models previously trained in pure bf16,
122
+ we add this option.
123
+ scaling_factor: RotaryEmbedding extended with linear scaling.
124
+ """
125
+ super().__init__()
126
+ self.dim = dim
127
+ self.base = float(base)
128
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
129
+ # Generate and save the inverse frequency buffer (non trainable)
130
+ inv_freq = self._compute_inv_freq(device)
131
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
132
+ self.interleaved = interleaved
133
+ self.scale_base = scale_base
134
+ self.scaling_factor = scaling_factor
135
+ scale = ((torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim)
136
+ / (1.4 * dim) if scale_base is not None else None)
137
+ self.register_buffer("scale", scale)
138
+
139
+ self._seq_len_cached = 0
140
+ self._cos_cached = None
141
+ self._sin_cached = None
142
+ self._cos_k_cached = None
143
+ self._sin_k_cached = None
144
+
145
+ def _compute_inv_freq(self, device=None):
146
+ return 1 / (self.base ** (torch.arange(0, self.dim, 2, device=device,
147
+ dtype=torch.float32) / self.dim))
148
+
149
+
150
+ def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
151
+ # Reset the tables if the sequence length has changed,
152
+ # if we're on a new device (possibly due to tracing for instance),
153
+ # or if we're switching from inference mode to training
154
+ if (seqlen > self._seq_len_cached or self._cos_cached.device != device
155
+ or self._cos_cached.dtype != dtype
156
+ or (self.training and self._cos_cached.is_inference())):
157
+ self._seq_len_cached = seqlen
158
+ # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
159
+ # And the output of arange can be quite large, so bf16 would lose a lot of precision.
160
+ # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
161
+ if self.pos_idx_in_fp32:
162
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
163
+ t /= self.scaling_factor
164
+ # We want fp32 here as well since inv_freq will be multiplied with t, and the output
165
+ # will be large. Having it in bf16 will lose a lot of precision and cause the
166
+ # cos & sin output to change significantly.
167
+ # We want to recompute self.inv_freq if it was not loaded in fp32
168
+ if self.inv_freq.dtype != torch.float32:
169
+ inv_freq = self.inv_freq.to(torch.float32)
170
+ else:
171
+ inv_freq = self.inv_freq
172
+ else:
173
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
174
+ t /= self.scaling_factor
175
+ inv_freq = self.inv_freq
176
+ # Don't do einsum, it converts fp32 to fp16 under AMP
177
+ # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
178
+ freqs = torch.outer(t, inv_freq)
179
+ if self.scale is None:
180
+ self._cos_cached = torch.cos(freqs).to(dtype)
181
+ self._sin_cached = torch.sin(freqs).to(dtype)
182
+ else:
183
+ power = ((torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device)
184
+ - seqlen // 2) / self.scale_base)
185
+ scale = self.scale.to(device=power.device) ** power.unsqueeze(-1)
186
+ # We want the multiplication by scale to happen in fp32
187
+ self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
188
+ self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
189
+ self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
190
+ self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
191
+
192
+ def forward(self, q: torch.Tensor, k: torch.Tensor, seqlen_offset: int = 0) -> Tuple[torch.Tensor, torch.Tensor]:
193
+ """
194
+ q: (batch, seqlen, nheads, headdim)
195
+ k: (batch, seqlen, nheads, headdim)
196
+ seqlen_offset: can be used in generation where the qkv being passed in is only the last
197
+ token in the batch.
198
+ """
199
+ self._update_cos_sin_cache(q.shape[1] + seqlen_offset, device=q.device, dtype=q.dtype)
200
+ if self.scale is None:
201
+ return apply_rotary_emb_func(
202
+ q, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
203
+ self.interleaved, True # inplace=True
204
+ ), apply_rotary_emb_func(
205
+ k, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
206
+ self.interleaved, True # inplace=True
207
+ )
208
+ else:
209
+ assert False
210
+
211
+ class LlamaMLP(nn.Module):
212
+ def __init__(self, config):
213
+ super().__init__()
214
+ self.config = config
215
+ self.hidden_size = config.hidden_size
216
+ self.intermediate_size = config.intermediate_size
217
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
218
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
219
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
220
+ self.act_fn = ACT2FN[config.hidden_act]
221
+
222
+ def forward(self, x):
223
+ if self.config.pretraining_tp > 1:
224
+ slice = self.intermediate_size // self.config.pretraining_tp
225
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
226
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
227
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
228
+
229
+ gate_proj = torch.cat(
230
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
231
+ )
232
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
233
+
234
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
235
+ down_proj = [
236
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
237
+ ]
238
+ down_proj = sum(down_proj)
239
+ else:
240
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
241
+
242
+ return down_proj
243
+
244
+ @torch.jit.script
245
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
246
+ """
247
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
248
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
249
+ """
250
+ batch, slen, _, num_key_value_heads, head_dim = hidden_states.shape
251
+ if n_rep == 1:
252
+ return hidden_states
253
+ hidden_states = hidden_states[:, :, :, :, None, :].expand(batch, slen, 2, num_key_value_heads, n_rep, head_dim)
254
+ return hidden_states.reshape(batch, slen, 2, num_key_value_heads * n_rep, head_dim)
255
+
256
+
257
+ class LlamaAttention(nn.Module):
258
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
259
+
260
+ def __init__(self, config: LlamaConfig):
261
+ super().__init__()
262
+ self.config = config
263
+ self.hidden_size = config.hidden_size
264
+ self.num_heads = config.num_attention_heads
265
+ self.head_dim = self.hidden_size // self.num_heads
266
+ self.num_key_value_heads = config.num_key_value_heads
267
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
268
+ self.max_position_embeddings = config.max_position_embeddings
269
+
270
+ if (self.head_dim * self.num_heads) != self.hidden_size:
271
+ raise ValueError(
272
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
273
+ f" and `num_heads`: {self.num_heads})."
274
+ )
275
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
276
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
277
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
278
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
279
+
280
+ self.register_buffer(
281
+ "norm_factor",
282
+ torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()),
283
+ persistent=False,
284
+ )
285
+
286
+ if self.config.rope_scaling is None:
287
+ scaling_factor = 1
288
+ else:
289
+ scaling_type = self.config.rope_scaling["type"]
290
+ scaling_factor = self.config.rope_scaling["factor"]
291
+ assert scaling_type == 'linear'
292
+
293
+ self.rotary_emb = FlashRotaryEmbedding(
294
+ self.head_dim, base=10000, interleaved=False, scaling_factor=scaling_factor,
295
+ )
296
+
297
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
298
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
299
+
300
+ def forward(
301
+ self,
302
+ hidden_states: torch.Tensor,
303
+ attention_mask: Optional[torch.Tensor] = None,
304
+ position_ids: Optional[torch.LongTensor] = None,
305
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
306
+ output_attentions: bool = False,
307
+ use_cache: bool = False,
308
+ is_padded_inputs: Optional[bool] = False,
309
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
310
+ bsz, q_len, h_size = hidden_states.size()
311
+
312
+ has_layer_past = past_key_value is not None
313
+
314
+ if has_layer_past:
315
+ past_kv = past_key_value[0]
316
+ past_len = past_key_value[1]
317
+ else:
318
+ past_len = 0
319
+
320
+ if self.config.pretraining_tp > 1:
321
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
322
+ query_slices = self.q_proj.weight.split(
323
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
324
+ )
325
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
326
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
327
+
328
+ q = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
329
+ q = torch.cat(q, dim=-1)
330
+
331
+ k = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
332
+ k = torch.cat(k, dim=-1)
333
+
334
+ v = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
335
+ v = torch.cat(v, dim=-1)
336
+
337
+ else:
338
+ q = self.q_proj(hidden_states)
339
+ k = self.k_proj(hidden_states)
340
+ v = self.v_proj(hidden_states)
341
+
342
+ q = q.view(bsz, q_len, self.num_heads, self.head_dim)
343
+ k = k.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
344
+ v = v.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
345
+
346
+ q, k = self.rotary_emb(q, k, past_len)
347
+
348
+ kv = torch.stack([k, v], 2)
349
+ kv = repeat_kv(kv, self.num_key_value_groups)
350
+
351
+ # Cache QKV values
352
+ if has_layer_past:
353
+ new_len = past_len+q.size(1)
354
+ if new_len > past_kv.size(1):
355
+ past_kv = torch.cat([past_kv, torch.empty(bsz, 256, 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)
356
+ past_kv[:, past_len:new_len] = kv
357
+ kv = past_kv[:, :new_len]
358
+ else:
359
+ past_kv = kv
360
+
361
+ past_key_value = (past_kv, past_len+q.size(1)) if use_cache else None
362
+
363
+ if is_padded_inputs:
364
+
365
+ # varlen, ignore padding tokens, efficient for large batch with many paddings
366
+
367
+ assert attention_mask is not None
368
+
369
+ unpadded_kv, indices_k, cu_seqlens_k, max_seqlen_k = unpad_input(kv, attention_mask)
370
+ unpadded_q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask[:, -q.size(1):])
371
+ attn_outputs = flash_attn_varlen_kvpacked_func(
372
+ unpadded_q, unpadded_kv, cu_seqlens_q, cu_seqlens_k,
373
+ max_seqlen_q, max_seqlen_k,
374
+ dropout_p=0.0, softmax_scale=1.0/self.norm_factor,
375
+ causal=(not has_layer_past), return_attn_probs=output_attentions
376
+ )
377
+
378
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
379
+ attn_output = pad_input(
380
+ attn_output, indices_q, bsz, q_len
381
+ ).reshape(bsz, q_len, h_size)
382
+ attn_weights = attn_outputs[2] if output_attentions else None
383
+
384
+ else:
385
+
386
+ # no padding tokens, more efficient
387
+
388
+ attn_outputs = flash_attn_kvpacked_func(
389
+ q, kv, dropout_p=0.0, softmax_scale=1.0/self.norm_factor, causal=(not has_layer_past), return_attn_probs=output_attentions)
390
+
391
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
392
+ attn_output = attn_output.reshape(bsz, q_len, h_size)
393
+ attn_weights = attn_outputs[2] if output_attentions else None
394
+
395
+ if self.config.pretraining_tp > 1:
396
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
397
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
398
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
399
+ else:
400
+ attn_output = self.o_proj(attn_output)
401
+
402
+ if not output_attentions:
403
+ attn_weights = None
404
+
405
+ return attn_output, attn_weights, past_key_value
406
+
407
+
408
+ class LlamaDecoderLayer(nn.Module):
409
+ def __init__(self, config: LlamaConfig):
410
+ super().__init__()
411
+ self.hidden_size = config.hidden_size
412
+ self.self_attn = LlamaAttention(config=config)
413
+ self.mlp = LlamaMLP(config)
414
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
415
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
416
+
417
+ def forward(
418
+ self,
419
+ hidden_states: torch.Tensor,
420
+ attention_mask: Optional[torch.Tensor] = None,
421
+ position_ids: Optional[torch.LongTensor] = None,
422
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
423
+ is_padded_inputs: Optional[bool] = False,
424
+ output_attentions: Optional[bool] = False,
425
+ use_cache: Optional[bool] = False,
426
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
427
+ """
428
+ Args:
429
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
430
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
431
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
432
+ output_attentions (`bool`, *optional*):
433
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
434
+ returned tensors for more detail.
435
+ use_cache (`bool`, *optional*):
436
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
437
+ (see `past_key_values`).
438
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
439
+ """
440
+
441
+ residual = hidden_states
442
+
443
+ hidden_states = self.input_layernorm(hidden_states)
444
+
445
+ # Self Attention
446
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
447
+ hidden_states=hidden_states,
448
+ attention_mask=attention_mask,
449
+ position_ids=position_ids,
450
+ past_key_value=past_key_value,
451
+ output_attentions=output_attentions,
452
+ use_cache=use_cache,
453
+ is_padded_inputs=is_padded_inputs,
454
+ )
455
+ hidden_states = residual + hidden_states
456
+
457
+ # Fully Connected
458
+ residual = hidden_states
459
+ hidden_states = self.post_attention_layernorm(hidden_states)
460
+ hidden_states = self.mlp(hidden_states)
461
+ hidden_states = residual + hidden_states
462
+
463
+ outputs = (hidden_states,)
464
+
465
+ if output_attentions:
466
+ outputs += (self_attn_weights,)
467
+
468
+ if use_cache:
469
+ outputs += (present_key_value,)
470
+
471
+ return outputs
472
+
473
+
474
+ LLAMA_START_DOCSTRING = r"""
475
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
476
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
477
+ etc.)
478
+
479
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
480
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
481
+ and behavior.
482
+
483
+ Parameters:
484
+ config ([`LlamaConfig`]):
485
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
486
+ load the weights associated with the model, only the configuration. Check out the
487
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
488
+ """
489
+
490
+
491
+ @add_start_docstrings(
492
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
493
+ LLAMA_START_DOCSTRING,
494
+ )
495
+ class LlamaPreTrainedModel(PreTrainedModel):
496
+ config_class = LlamaConfig
497
+ base_model_prefix = "model"
498
+ supports_gradient_checkpointing = True
499
+ _no_split_modules = ["LlamaDecoderLayer"]
500
+ _skip_keys_device_placement = "past_key_values"
501
+
502
+ def _init_weights(self, module):
503
+ std = self.config.initializer_range
504
+ if isinstance(module, nn.Linear):
505
+ module.weight.data.normal_(mean=0.0, std=std)
506
+ if module.bias is not None:
507
+ module.bias.data.zero_()
508
+ elif isinstance(module, nn.Embedding):
509
+ module.weight.data.normal_(mean=0.0, std=std)
510
+ if module.padding_idx is not None:
511
+ module.weight.data[module.padding_idx].zero_()
512
+
513
+ def _set_gradient_checkpointing(self, module, value=False):
514
+ if isinstance(module, LlamaModel):
515
+ module.gradient_checkpointing = value
516
+
517
+
518
+ LLAMA_INPUTS_DOCSTRING = r"""
519
+ Args:
520
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
521
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
522
+ it.
523
+
524
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
525
+ [`PreTrainedTokenizer.__call__`] for details.
526
+
527
+ [What are input IDs?](../glossary#input-ids)
528
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
529
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
530
+
531
+ - 1 for tokens that are **not masked**,
532
+ - 0 for tokens that are **masked**.
533
+
534
+ [What are attention masks?](../glossary#attention-mask)
535
+
536
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
537
+ [`PreTrainedTokenizer.__call__`] for details.
538
+
539
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
540
+ `past_key_values`).
541
+
542
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
543
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
544
+ information on the default strategy.
545
+
546
+ - 1 indicates the head is **not masked**,
547
+ - 0 indicates the head is **masked**.
548
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
549
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
550
+ config.n_positions - 1]`.
551
+
552
+ [What are position IDs?](../glossary#position-ids)
553
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
554
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
555
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
556
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
557
+
558
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
559
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
560
+
561
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
562
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
563
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
564
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
565
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
566
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
567
+ model's internal embedding lookup matrix.
568
+ use_cache (`bool`, *optional*):
569
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
570
+ `past_key_values`).
571
+ output_attentions (`bool`, *optional*):
572
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
573
+ tensors for more detail.
574
+ output_hidden_states (`bool`, *optional*):
575
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
576
+ more detail.
577
+ return_dict (`bool`, *optional*):
578
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
579
+ """
580
+
581
+
582
+ @add_start_docstrings(
583
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
584
+ LLAMA_START_DOCSTRING,
585
+ )
586
+ class LlamaModel(LlamaPreTrainedModel):
587
+ """
588
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
589
+
590
+ Args:
591
+ config: LlamaConfig
592
+ """
593
+
594
+ def __init__(self, config: LlamaConfig):
595
+ super().__init__(config)
596
+ self.padding_idx = config.pad_token_id
597
+ self.vocab_size = config.vocab_size
598
+
599
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
600
+ self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
601
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
602
+
603
+ self.gradient_checkpointing = False
604
+ # Initialize weights and apply final processing
605
+ self.post_init()
606
+
607
+ def get_input_embeddings(self):
608
+ return self.embed_tokens
609
+
610
+ def set_input_embeddings(self, value):
611
+ self.embed_tokens = value
612
+
613
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
614
+ def forward(
615
+ self,
616
+ input_ids: torch.LongTensor = None,
617
+ attention_mask: Optional[torch.Tensor] = None,
618
+ position_ids: Optional[torch.LongTensor] = None,
619
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
620
+ inputs_embeds: Optional[torch.FloatTensor] = None,
621
+ use_cache: Optional[bool] = None,
622
+ output_attentions: Optional[bool] = None,
623
+ output_hidden_states: Optional[bool] = None,
624
+ return_dict: Optional[bool] = None,
625
+ is_padded_inputs: Optional[bool] = False,
626
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
627
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
628
+ output_hidden_states = (
629
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
630
+ )
631
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
632
+
633
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
634
+
635
+ # retrieve input_ids and inputs_embeds
636
+ if input_ids is not None and inputs_embeds is not None:
637
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
638
+ elif input_ids is not None:
639
+ batch_size, seq_length = input_ids.shape
640
+ elif inputs_embeds is not None:
641
+ batch_size, seq_length, _ = inputs_embeds.shape
642
+ else:
643
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
644
+
645
+ seq_length_with_past = seq_length
646
+ past_key_values_length = 0
647
+
648
+ if past_key_values is not None:
649
+ past_key_values_length = past_key_values[0][0].shape[2]
650
+ seq_length_with_past = seq_length_with_past + past_key_values_length
651
+
652
+ position_ids = None
653
+
654
+ if inputs_embeds is None:
655
+ inputs_embeds = self.embed_tokens(input_ids)
656
+
657
+ hidden_states = inputs_embeds
658
+
659
+ if self.gradient_checkpointing and self.training:
660
+ if use_cache:
661
+ logger.warning_once(
662
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
663
+ )
664
+ use_cache = False
665
+
666
+ # decoder layers
667
+ all_hidden_states = () if output_hidden_states else None
668
+ all_self_attns = () if output_attentions else None
669
+ next_decoder_cache = () if use_cache else None
670
+
671
+ for idx, decoder_layer in enumerate(self.layers):
672
+ if output_hidden_states:
673
+ all_hidden_states += (hidden_states,)
674
+
675
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
676
+
677
+ if self.gradient_checkpointing and self.training:
678
+
679
+ def create_custom_forward(module):
680
+ def custom_forward(*inputs):
681
+ # None for past_key_value
682
+ return module(*inputs, output_attentions, None)
683
+
684
+ return custom_forward
685
+
686
+ layer_outputs = torch.utils.checkpoint.checkpoint(
687
+ create_custom_forward(decoder_layer),
688
+ hidden_states,
689
+ attention_mask,
690
+ position_ids,
691
+ None,
692
+ is_padded_inputs
693
+ )
694
+ else:
695
+ layer_outputs = decoder_layer(
696
+ hidden_states,
697
+ attention_mask=attention_mask,
698
+ position_ids=position_ids,
699
+ past_key_value=past_key_value,
700
+ output_attentions=output_attentions,
701
+ use_cache=use_cache,
702
+ is_padded_inputs=is_padded_inputs,
703
+ )
704
+
705
+ hidden_states = layer_outputs[0]
706
+
707
+ if use_cache:
708
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
709
+
710
+ if output_attentions:
711
+ all_self_attns += (layer_outputs[1],)
712
+
713
+ hidden_states = self.norm(hidden_states)
714
+
715
+ # add hidden states from the last decoder layer
716
+ if output_hidden_states:
717
+ all_hidden_states += (hidden_states,)
718
+
719
+ next_cache = next_decoder_cache if use_cache else None
720
+ if not return_dict:
721
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
722
+ return BaseModelOutputWithPast(
723
+ last_hidden_state=hidden_states,
724
+ past_key_values=next_cache,
725
+ hidden_states=all_hidden_states,
726
+ attentions=all_self_attns,
727
+ )
728
+
729
+
730
+ class LlamaForCausalLM(LlamaPreTrainedModel):
731
+ _tied_weights_keys = ["lm_head.weight"]
732
+
733
+ def __init__(self, config):
734
+ super().__init__(config)
735
+ self.model = LlamaModel(config)
736
+ self.vocab_size = config.vocab_size
737
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
738
+
739
+ # Initialize weights and apply final processing
740
+ self.post_init()
741
+
742
+ def get_input_embeddings(self):
743
+ return self.model.embed_tokens
744
+
745
+ def set_input_embeddings(self, value):
746
+ self.model.embed_tokens = value
747
+
748
+ def get_output_embeddings(self):
749
+ return self.lm_head
750
+
751
+ def set_output_embeddings(self, new_embeddings):
752
+ self.lm_head = new_embeddings
753
+
754
+ def set_decoder(self, decoder):
755
+ self.model = decoder
756
+
757
+ def get_decoder(self):
758
+ return self.model
759
+
760
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
761
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
762
+ def forward(
763
+ self,
764
+ input_ids: torch.LongTensor = None,
765
+ attention_mask: Optional[torch.Tensor] = None,
766
+ position_ids: Optional[torch.LongTensor] = None,
767
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
768
+ inputs_embeds: Optional[torch.FloatTensor] = None,
769
+ labels: Optional[torch.LongTensor] = None,
770
+ use_cache: Optional[bool] = None,
771
+ output_attentions: Optional[bool] = None,
772
+ output_hidden_states: Optional[bool] = None,
773
+ return_dict: Optional[bool] = None,
774
+ is_padded_inputs: Optional[bool] = None,
775
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
776
+ r"""
777
+ Args:
778
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
779
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
780
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
781
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
782
+
783
+ Returns:
784
+
785
+ Example:
786
+
787
+ ```python
788
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
789
+
790
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
791
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
792
+
793
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
794
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
795
+
796
+ >>> # Generate
797
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
798
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
799
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
800
+ ```"""
801
+
802
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
803
+ output_hidden_states = (
804
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
805
+ )
806
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
807
+
808
+ is_padded_inputs = ((attention_mask is not None) and (not attention_mask.all().item()))
809
+
810
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
811
+ outputs = self.model(
812
+ input_ids=input_ids,
813
+ attention_mask=attention_mask,
814
+ position_ids=position_ids,
815
+ past_key_values=past_key_values,
816
+ inputs_embeds=inputs_embeds,
817
+ use_cache=use_cache,
818
+ output_attentions=output_attentions,
819
+ output_hidden_states=output_hidden_states,
820
+ return_dict=return_dict,
821
+ is_padded_inputs=is_padded_inputs,
822
+ )
823
+
824
+ hidden_states = outputs[0]
825
+ if self.config.pretraining_tp > 1:
826
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
827
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
828
+ logits = torch.cat(logits, dim=-1)
829
+ else:
830
+ logits = self.lm_head(hidden_states)
831
+ logits = logits.float()
832
+
833
+ loss = None
834
+ if labels is not None:
835
+ # Shift so that tokens < n predict n
836
+ shift_logits = logits[..., :-1, :].contiguous()
837
+ shift_labels = labels[..., 1:].contiguous()
838
+ # Flatten the tokens
839
+ loss_fct = CrossEntropyLoss()
840
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
841
+ shift_labels = shift_labels.view(-1)
842
+ # Enable model parallelism
843
+ shift_labels = shift_labels.to(shift_logits.device)
844
+ loss = loss_fct(shift_logits, shift_labels)
845
+
846
+ if not return_dict:
847
+ output = (logits,) + outputs[1:]
848
+ return (loss,) + output if loss is not None else output
849
+
850
+ return CausalLMOutputWithPast(
851
+ loss=loss,
852
+ logits=logits,
853
+ past_key_values=outputs.past_key_values,
854
+ hidden_states=outputs.hidden_states,
855
+ attentions=outputs.attentions,
856
+ )
857
+
858
+ def prepare_inputs_for_generation(
859
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
860
+ ):
861
+ if past_key_values:
862
+ input_ids = input_ids[:, -1:]
863
+
864
+ position_ids = kwargs.get("position_ids", None)
865
+
866
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
867
+ if inputs_embeds is not None and past_key_values is None:
868
+ model_inputs = {"inputs_embeds": inputs_embeds}
869
+ else:
870
+ model_inputs = {"input_ids": input_ids}
871
+
872
+ model_inputs.update(
873
+ {
874
+ "position_ids": position_ids,
875
+ "past_key_values": past_key_values,
876
+ "use_cache": kwargs.get("use_cache"),
877
+ "attention_mask": attention_mask,
878
+ "is_padded_inputs": ((attention_mask is not None) and (not attention_mask.all().item()))
879
+ }
880
+ )
881
+ return model_inputs
882
+
883
+ @staticmethod
884
+ def _reorder_cache(past_key_values, beam_idx):
885
+ reordered_past = ()
886
+ for layer_past in past_key_values:
887
+ reordered_past += (
888
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
889
+ )
890
+ return reordered_past
891
+
892
+
893
+ @add_start_docstrings(
894
+ """
895
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
896
+
897
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
898
+ (e.g. GPT-2) do.
899
+
900
+ Since it does classification on the last token, it requires to know the position of the last token. If a
901
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
902
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
903
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
904
+ each row of the batch).
905
+ """,
906
+ LLAMA_START_DOCSTRING,
907
+ )
908
+ class LlamaForSequenceClassification(LlamaPreTrainedModel):
909
+ def __init__(self, config):
910
+ super().__init__(config)
911
+ self.num_labels = config.num_labels
912
+ self.model = LlamaModel(config)
913
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
914
+
915
+ # Initialize weights and apply final processing
916
+ self.post_init()
917
+
918
+ def get_input_embeddings(self):
919
+ return self.model.embed_tokens
920
+
921
+ def set_input_embeddings(self, value):
922
+ self.model.embed_tokens = value
923
+
924
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
925
+ def forward(
926
+ self,
927
+ input_ids: torch.LongTensor = None,
928
+ attention_mask: Optional[torch.Tensor] = None,
929
+ position_ids: Optional[torch.LongTensor] = None,
930
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
931
+ inputs_embeds: Optional[torch.FloatTensor] = None,
932
+ labels: Optional[torch.LongTensor] = None,
933
+ use_cache: Optional[bool] = None,
934
+ output_attentions: Optional[bool] = None,
935
+ output_hidden_states: Optional[bool] = None,
936
+ return_dict: Optional[bool] = None,
937
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
938
+ r"""
939
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
940
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
941
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
942
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
943
+ """
944
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
945
+
946
+ transformer_outputs = self.model(
947
+ input_ids,
948
+ attention_mask=attention_mask,
949
+ position_ids=position_ids,
950
+ past_key_values=past_key_values,
951
+ inputs_embeds=inputs_embeds,
952
+ use_cache=use_cache,
953
+ output_attentions=output_attentions,
954
+ output_hidden_states=output_hidden_states,
955
+ return_dict=return_dict,
956
+ )
957
+ hidden_states = transformer_outputs[0]
958
+ logits = self.score(hidden_states)
959
+
960
+ if input_ids is not None:
961
+ batch_size = input_ids.shape[0]
962
+ else:
963
+ batch_size = inputs_embeds.shape[0]
964
+
965
+ if self.config.pad_token_id is None and batch_size != 1:
966
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
967
+ if self.config.pad_token_id is None:
968
+ sequence_lengths = -1
969
+ else:
970
+ if input_ids is not None:
971
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
972
+ else:
973
+ sequence_lengths = -1
974
+
975
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
976
+
977
+ loss = None
978
+ if labels is not None:
979
+ labels = labels.to(logits.device)
980
+ if self.config.problem_type is None:
981
+ if self.num_labels == 1:
982
+ self.config.problem_type = "regression"
983
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
984
+ self.config.problem_type = "single_label_classification"
985
+ else:
986
+ self.config.problem_type = "multi_label_classification"
987
+
988
+ if self.config.problem_type == "regression":
989
+ loss_fct = MSELoss()
990
+ if self.num_labels == 1:
991
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
992
+ else:
993
+ loss = loss_fct(pooled_logits, labels)
994
+ elif self.config.problem_type == "single_label_classification":
995
+ loss_fct = CrossEntropyLoss()
996
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
997
+ elif self.config.problem_type == "multi_label_classification":
998
+ loss_fct = BCEWithLogitsLoss()
999
+ loss = loss_fct(pooled_logits, labels)
1000
+ if not return_dict:
1001
+ output = (pooled_logits,) + transformer_outputs[1:]
1002
+ return ((loss,) + output) if loss is not None else output
1003
+
1004
+ return SequenceClassifierOutputWithPast(
1005
+ loss=loss,
1006
+ logits=pooled_logits,
1007
+ past_key_values=transformer_outputs.past_key_values,
1008
+ hidden_states=transformer_outputs.hidden_states,
1009
+ attentions=transformer_outputs.attentions,
1010
+ )
pytorch_model-00001-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4151c06d014b38287da1696448d116e472a4320eec50e0c9077ec5fc028bc4fe
3
+ size 9950030153
pytorch_model-00002-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02a5f64fd87ef6202c63b2d5c9e6e9b26816cc15639ab19e065b4f41edfd3c90
3
+ size 9904155408
pytorch_model-00003-of-00003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7f0fd9871c39323f4b8c676b899491048c7380c5eeb874891af1f402cbcdc54
3
+ size 6180288927
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 26034350080
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00003-of-00003.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00003.bin",
8
+ "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
9
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
10
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
11
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
12
+ "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
13
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
14
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
15
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
16
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
17
+ "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
18
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
19
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
20
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
21
+ "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
22
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
23
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
24
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
25
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
26
+ "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
27
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
28
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
29
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
30
+ "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
31
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
32
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
33
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
34
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
35
+ "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
36
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
37
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
38
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
39
+ "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
40
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
41
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
42
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
43
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
44
+ "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
45
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
46
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
47
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
48
+ "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
49
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
50
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
51
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
52
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
53
+ "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
54
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
55
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
56
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
57
+ "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
58
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
59
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
60
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
61
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
62
+ "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
63
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
64
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
65
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
66
+ "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
67
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
68
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
69
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
70
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
71
+ "model.layers.15.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
72
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
73
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
74
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
75
+ "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
76
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
77
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
78
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
79
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
80
+ "model.layers.16.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
81
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
82
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
83
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
84
+ "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
85
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
86
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
87
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
88
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
89
+ "model.layers.17.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
90
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
91
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
92
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
93
+ "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
94
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
95
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
96
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
97
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
98
+ "model.layers.18.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
99
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
100
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
101
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
102
+ "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
103
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
104
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
105
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
106
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
107
+ "model.layers.19.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
108
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
109
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
110
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
111
+ "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
112
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
113
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
114
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
115
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
116
+ "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
117
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
118
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
119
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
120
+ "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
121
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
122
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
123
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
124
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
125
+ "model.layers.20.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
126
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
127
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
128
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
129
+ "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
130
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
131
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
132
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
133
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
134
+ "model.layers.21.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
135
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
136
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
137
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
138
+ "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
139
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
140
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
141
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
142
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
143
+ "model.layers.22.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
144
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
145
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
146
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
147
+ "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
148
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
149
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
150
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
151
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
152
+ "model.layers.23.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
153
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
154
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
155
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
156
+ "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
157
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
158
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
159
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
160
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
161
+ "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
162
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
163
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
164
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
165
+ "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
166
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
167
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
168
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
169
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
170
+ "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
171
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
172
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
173
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
174
+ "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
175
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
176
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
177
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
178
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
179
+ "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
180
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
181
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
182
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
183
+ "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
184
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
185
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
186
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
187
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
188
+ "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
189
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
190
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
191
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
192
+ "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
193
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
194
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
195
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
196
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
197
+ "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
198
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
199
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
200
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
201
+ "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
202
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
203
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
204
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
205
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
206
+ "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00003.bin",
207
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00003.bin",
208
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
209
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
210
+ "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00003.bin",
211
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
212
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
213
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
214
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
215
+ "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
216
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
217
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
218
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
219
+ "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
220
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
221
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
222
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
223
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
224
+ "model.layers.30.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
225
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
226
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00003.bin",
227
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00002-of-00003.bin",
228
+ "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
229
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00003.bin",
230
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00003.bin",
231
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00003.bin",
232
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00003.bin",
233
+ "model.layers.31.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
234
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
235
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
236
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
237
+ "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
238
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
239
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
240
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
241
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
242
+ "model.layers.32.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
243
+ "model.layers.32.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
244
+ "model.layers.32.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
245
+ "model.layers.32.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
246
+ "model.layers.32.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
247
+ "model.layers.32.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
248
+ "model.layers.32.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
249
+ "model.layers.32.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
250
+ "model.layers.32.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
251
+ "model.layers.33.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
252
+ "model.layers.33.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
253
+ "model.layers.33.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
254
+ "model.layers.33.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
255
+ "model.layers.33.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
256
+ "model.layers.33.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
257
+ "model.layers.33.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
258
+ "model.layers.33.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
259
+ "model.layers.33.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
260
+ "model.layers.34.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
261
+ "model.layers.34.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
262
+ "model.layers.34.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
263
+ "model.layers.34.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
264
+ "model.layers.34.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
265
+ "model.layers.34.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
266
+ "model.layers.34.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
267
+ "model.layers.34.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
268
+ "model.layers.34.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
269
+ "model.layers.35.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
270
+ "model.layers.35.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
271
+ "model.layers.35.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
272
+ "model.layers.35.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
273
+ "model.layers.35.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
274
+ "model.layers.35.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
275
+ "model.layers.35.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
276
+ "model.layers.35.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
277
+ "model.layers.35.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
278
+ "model.layers.36.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
279
+ "model.layers.36.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
280
+ "model.layers.36.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
281
+ "model.layers.36.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
282
+ "model.layers.36.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
283
+ "model.layers.36.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
284
+ "model.layers.36.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
285
+ "model.layers.36.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
286
+ "model.layers.36.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
287
+ "model.layers.37.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
288
+ "model.layers.37.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
289
+ "model.layers.37.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
290
+ "model.layers.37.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
291
+ "model.layers.37.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
292
+ "model.layers.37.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
293
+ "model.layers.37.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
294
+ "model.layers.37.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
295
+ "model.layers.37.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
296
+ "model.layers.38.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
297
+ "model.layers.38.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
298
+ "model.layers.38.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
299
+ "model.layers.38.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
300
+ "model.layers.38.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
301
+ "model.layers.38.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
302
+ "model.layers.38.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
303
+ "model.layers.38.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
304
+ "model.layers.38.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
305
+ "model.layers.39.input_layernorm.weight": "pytorch_model-00003-of-00003.bin",
306
+ "model.layers.39.mlp.down_proj.weight": "pytorch_model-00003-of-00003.bin",
307
+ "model.layers.39.mlp.gate_proj.weight": "pytorch_model-00003-of-00003.bin",
308
+ "model.layers.39.mlp.up_proj.weight": "pytorch_model-00003-of-00003.bin",
309
+ "model.layers.39.post_attention_layernorm.weight": "pytorch_model-00003-of-00003.bin",
310
+ "model.layers.39.self_attn.k_proj.weight": "pytorch_model-00003-of-00003.bin",
311
+ "model.layers.39.self_attn.o_proj.weight": "pytorch_model-00003-of-00003.bin",
312
+ "model.layers.39.self_attn.q_proj.weight": "pytorch_model-00003-of-00003.bin",
313
+ "model.layers.39.self_attn.v_proj.weight": "pytorch_model-00003-of-00003.bin",
314
+ "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
315
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
316
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
317
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
318
+ "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
319
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
320
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
321
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
322
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
323
+ "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
324
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
325
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
326
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
327
+ "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
328
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
329
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
330
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
331
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
332
+ "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
333
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
334
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
335
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
336
+ "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
337
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
338
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
339
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
340
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
341
+ "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
342
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
343
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
344
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
345
+ "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
346
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
347
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
348
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
349
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
350
+ "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
351
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
352
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
353
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
354
+ "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
355
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
356
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
357
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
358
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
359
+ "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00003.bin",
360
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00003.bin",
361
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00003.bin",
362
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00003.bin",
363
+ "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00003.bin",
364
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00003.bin",
365
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00003.bin",
366
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00003.bin",
367
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00003.bin",
368
+ "model.norm.weight": "pytorch_model-00003-of-00003.bin"
369
+ }
370
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>"
5
+ ],
6
+ "bos_token": "<|im_start|>",
7
+ "cls_token": "<CLS>",
8
+ "eos_token": "<|im_end|>",
9
+ "mask_token": "<MASK>",
10
+ "pad_token": "<PAD>",
11
+ "sep_token": "<SEP>",
12
+ "unk_token": "<unk>"
13
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "32000": {
28
+ "content": "<CLS>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "32001": {
36
+ "content": "<SEP>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "32002": {
44
+ "content": "<EOD>",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "32003": {
52
+ "content": "<MASK>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "32004": {
60
+ "content": "<PAD>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "32005": {
68
+ "content": "<|im_start|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "32006": {
76
+ "content": "<|im_end|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ }
83
+ },
84
+ "additional_special_tokens": [
85
+ "<|im_start|>",
86
+ "<|im_end|>"
87
+ ],
88
+ "bos_token": "<|im_start|>",
89
+ "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
90
+ "clean_up_tokenization_spaces": false,
91
+ "cls_token": "<CLS>",
92
+ "eos_token": "<|im_end|>",
93
+ "legacy": true,
94
+ "mask_token": "<MASK>",
95
+ "model_max_length": 1000000000000000019884624838656,
96
+ "pad_token": "<PAD>",
97
+ "padding_side": "right",
98
+ "sep_token": "<SEP>",
99
+ "sp_model_kwargs": {},
100
+ "tokenizer_class": "LlamaTokenizer",
101
+ "unk_token": "<unk>",
102
+ "use_default_system_prompt": false
103
+ }