yaya-sy commited on
Commit
8e81c5b
1 Parent(s): 234265a

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - sw
5
+ - zu
6
+ - xh
7
+ - ha
8
+ - yo
9
+ pipeline_tag: text-generation
10
+ tags:
11
+ - nlp
12
+ - InkubaLM
13
+ - africanLLM
14
+ - africa
15
+ - llm
16
+ datasets:
17
+ - lelapa/Inkuba-Mono
18
+ license: cc-by-nc-4.0
19
+ ---
20
+ # InkubaLM-0.4B: Small language model for low-resource African Languages
21
+
22
+ <!-- Provide a quick summary of what the model is/does. -->
23
+
24
+
25
+ ![ ](InkubaLM.png)
26
+
27
+ ## Model Details
28
+ InkubaLM has been trained from scratch using 1.9 billion tokens of data for five African languages, along with English and French data, totaling 2.4 billion tokens of data.
29
+ Similar to the model architecture used for MobileLLM, we trained this InkubaLM with a parameter size of 0.4 billion and a vocabulary size of 61788.
30
+ For detailed information on training, benchmarks, and performance, please refer to our full [blog post](https://medium.com/@lelapa_ai/inkubalm-a-small-language-model-for-low-resource-african-languages-dc9793842dec).
31
+ ### Model Description
32
+
33
+ <!-- Provide a longer summary of what this model is. -->
34
+
35
+
36
+
37
+ - **Developed by:** [Lelapa AI](https://lelapa.ai/) - Fundamental Research Team.
38
+ - **Model type:** Small Language Model (SLM) for five African languages built using the architecture design of LLaMA-7B.
39
+ - **Language(s) (NLP):** isiZulu, Yoruba, Swahili, isiXhosa, Hausa, English and French.
40
+ - **License:** CC BY-NC 4.0.
41
+
42
+ ### Model Sources
43
+
44
+ <!-- Provide the basic links for the model. -->
45
+
46
+ - **Repository:** TBD
47
+ - **Paper :** [InkubaLM](https://arxiv.org/pdf/2408.17024)
48
+
49
+
50
+ ## How to Get Started with the Model
51
+
52
+ Use the code below to get started with the model.
53
+
54
+ ``` python
55
+ pip install transformers
56
+
57
+ ```
58
+ # Running the model on CPU/GPU/multi GPU
59
+ ## - Running the model on CPU
60
+ ``` Python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+
63
+ tokenizer = AutoTokenizer.from_pretrained("lelapa/InkubaLM-0.4B",trust_remote_code=True)
64
+ model = AutoModelForCausalLM.from_pretrained("lelapa/InkubaLM-0.4B",trust_remote_code=True)
65
+
66
+ text = "Today I planned to"
67
+ inputs = tokenizer(text, return_tensors="pt")
68
+ input_ids = inputs.input_ids
69
+
70
+ # Create an attention mask
71
+ attention_mask = inputs.attention_mask
72
+
73
+ # Generate outputs using the attention mask
74
+ outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=60,pad_token_id=tokenizer.eos_token_id)
75
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
76
+
77
+ ```
78
+ ## - Using full precision
79
+ ```python
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+
82
+ model = AutoModelForCausalLM.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True)
83
+ tokenizer = AutoTokenizer.from_pretrained("lelapa/InkubaLM-0.4B", trust_remote_code=True)
84
+
85
+ model.to('cuda')
86
+ text = "Today i planned to "
87
+ input_ids = tokenizer(text, return_tensors="pt").to('cuda').input_ids
88
+ outputs = model.generate(input_ids, max_length=1000, repetition_penalty=1.2, pad_token_id=tokenizer.eos_token_id)
89
+ print(tokenizer.batch_decode(outputs[:, input_ids.shape[1]:-1])[0].strip())
90
+
91
+ ```
92
+ ## - Using torch.bfloat16
93
+ ``` python
94
+ import torch
95
+ from transformers import AutoTokenizer, AutoModelForCausalLM
96
+ checkpoint = "lelapa/InkubaLM-0.4B"
97
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
98
+
99
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto",torch_dtype=torch.bfloat16, trust_remote_code=True)
100
+ inputs = tokenizer.encode("Today i planned to ", return_tensors="pt").to("cuda")
101
+ outputs = model.generate(inputs)
102
+ print(tokenizer.decode(outputs[0]))
103
+
104
+ ```
105
+ ## - Using quantized Versions via bitsandbytes
106
+
107
+ ``` python
108
+ pip install bitsandbytes accelerate
109
+ ```
110
+ ``` python
111
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
112
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True) # to use 4bit use `load_in_4bit=True` instead
113
+ checkpoint = "lelapa/InkubaLM-0.4B"
114
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
115
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config, trust_remote_code=True)
116
+ inputs = tokenizer.encode("Today i planned to ", return_tensors="pt").to("cuda")
117
+ outputs = model.generate(inputs)
118
+ print(tokenizer.decode(outputs[0]))
119
+
120
+ ```
121
+
122
+
123
+ ## Training Details
124
+
125
+ ### Training Data
126
+
127
+ - For training, we used the [Inkuba-mono](https://huggingface.co/datasets/lelapa/Inkuba-Mono) dataset.
128
+
129
+
130
+
131
+ #### Training Hyperparameters
132
+
133
+ | Hyperparameter | Value |
134
+ | ----------- | ----------- |
135
+ | Total Parameters | 0.422B |
136
+ | Hidden Size | 2048 |
137
+ | Intermediate Size (MLPs) | 5632 |
138
+ | Number of Attention Heads | 32 |
139
+ | Number of Hidden Layers | 8 |
140
+ | RMSNorm ɛ | 1e^-5 |
141
+ | Max Seq Length | 2048 |
142
+ | Vocab Size | 61788 |
143
+
144
+ ## Limitations
145
+ The InkubaLM model has been trained on multilingual datasets but does have some limitations. It is capable of understanding and generating content in five African languages: Swahili, Yoruba, Hausa, isiZulu, and isiXhosa, as well as English and French. While it can generate text on various topics, the resulting content may not always be entirely accurate, logically consistent, or free from biases found in the training data. Additionally, the model may sometimes use different languages when generating text. Nonetheless, this model is intended to be a foundational tool to aid research in African languages.
146
+
147
+ ## Ethical Considerations and Risks
148
+ InkubaLM is a small LM developed for five African languages. The model is evaluated only in sentiment analysis, machine translation, AfriMMLU, and AfriXNLI tasks and has yet to cover all possible evaluation scenarios. Similar to other language models, it is impossible to predict all of InkubaLM's potential outputs in advance, and in some cases, the model may produce inaccurate, biased, or objectionable responses. Therefore, before using the model in any application, the users should conduct safety testing and tuning tailored to their intended use.
149
+
150
+ ## Citation
151
+
152
+ ```
153
+ @article{tonja2024inkubalm,
154
+ title={InkubaLM: A small language model for low-resource African languages},
155
+ author={Tonja, Atnafu Lambebo and Dossou, Bonaventure FP and Ojo, Jessica and Rajab, Jenalea and Thior, Fadel and Wairagala, Eric Peter and Anuoluwapo, Aremu and Moiloa, Pelonomi and Abbott, Jade and Marivate, Vukosi and others},
156
+ journal={arXiv preprint arXiv:2408.17024},
157
+ year={2024}
158
+ }
159
+ ```
160
+
161
+ ## Model Card Authors
162
+
163
+ [Lelapa AI](https://lelapa.ai/) - Fundamental Research Team
164
+
165
+ ## Model Card Contact
166
+
167
+ [Lelapa AI](https://lelapa.ai/)
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "yaya-sy/lil-inkuba",
3
+ "architectures": [
4
+ "VulavulaLlamaForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "auto_map": {
9
+ "AutoModelForCausalLM": "yaya-sy/lil-inkuba--vulavulaslm.VulavulaLlamaForCausalLM"
10
+ },
11
+ "bos_token_id": 1,
12
+ "eos_token_id": 2,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 2048,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 5632,
17
+ "max_position_embeddings": 2048,
18
+ "mlp_bias": false,
19
+ "model_type": "llama",
20
+ "num_attention_heads": 32,
21
+ "num_hidden_layers": 8,
22
+ "num_key_value_heads": 32,
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-05,
25
+ "rope_scaling": null,
26
+ "rope_theta": 10000.0,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "bfloat16",
29
+ "transformers_version": "4.44.2",
30
+ "use_cache": true,
31
+ "vocab_size": 61788
32
+ }
config_vulavulaslm.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ LLaMA model configuration"""
21
+
22
+ from ...configuration_utils import PretrainedConfig
23
+ from ...utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
29
+
30
+
31
+ class LlamaConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
34
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
35
+ defaults will yield a similar configuration to that of the LLaMA-7B.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 32000):
43
+ Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`LlamaModel`]
45
+ hidden_size (`int`, *optional*, defaults to 4096):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 11008):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer decoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer decoder.
53
+ num_key_value_heads (`int`, *optional*):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
60
+ `num_attention_heads`.
61
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
62
+ The non-linear activation function (function or string) in the decoder.
63
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
64
+ The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
65
+ Llama 2 up to 4096, CodeLlama up to 16384.
66
+ initializer_range (`float`, *optional*, defaults to 0.02):
67
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
68
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
69
+ The epsilon used by the rms normalization layers.
70
+ use_cache (`bool`, *optional*, defaults to `True`):
71
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
72
+ relevant if `config.is_decoder=True`.
73
+ pad_token_id (`int`, *optional*):
74
+ Padding token id.
75
+ bos_token_id (`int`, *optional*, defaults to 1):
76
+ Beginning of stream token id.
77
+ eos_token_id (`int`, *optional*, defaults to 2):
78
+ End of stream token id.
79
+ pretraining_tp (`int`, *optional*, defaults to 1):
80
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
81
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
82
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
83
+ issue](https://github.com/pytorch/pytorch/issues/76232).
84
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
85
+ Whether to tie weight embeddings
86
+ rope_theta (`float`, *optional*, defaults to 10000.0):
87
+ The base period of the RoPE embeddings.
88
+ rope_scaling (`Dict`, *optional*):
89
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
90
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
91
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
92
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
93
+ these scaling strategies behave:
94
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
95
+ experimental feature, subject to breaking API changes in future versions.
96
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
97
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
98
+ attention_dropout (`float`, *optional*, defaults to 0.0):
99
+ The dropout ratio for the attention probabilities.
100
+
101
+ ```python
102
+
103
+ ```"""
104
+
105
+ model_type = "llama"
106
+ keys_to_ignore_at_inference = ["past_key_values"]
107
+
108
+ def __init__(
109
+ self,
110
+ vocab_size=61788,
111
+ hidden_size=4096,
112
+ intermediate_size=11008,
113
+ num_hidden_layers=32,
114
+ num_attention_heads=32,
115
+ num_key_value_heads=None,
116
+ hidden_act="silu",
117
+ max_position_embeddings=2048,
118
+ initializer_range=0.02,
119
+ rms_norm_eps=1e-6,
120
+ use_cache=True,
121
+ pad_token_id=None,
122
+ bos_token_id=1,
123
+ eos_token_id=2,
124
+ pretraining_tp=1,
125
+ tie_word_embeddings=False,
126
+ rope_theta=10000.0,
127
+ rope_scaling=None,
128
+ attention_bias=False,
129
+ attention_dropout=0.0,
130
+ **kwargs,
131
+ ):
132
+ self.vocab_size = vocab_size
133
+ self.max_position_embeddings = max_position_embeddings
134
+ self.hidden_size = hidden_size
135
+ self.intermediate_size = intermediate_size
136
+ self.num_hidden_layers = num_hidden_layers
137
+ self.num_attention_heads = num_attention_heads
138
+
139
+ # for backward compatibility
140
+ if num_key_value_heads is None:
141
+ num_key_value_heads = num_attention_heads
142
+
143
+ self.num_key_value_heads = num_key_value_heads
144
+ self.hidden_act = hidden_act
145
+ self.initializer_range = initializer_range
146
+ self.rms_norm_eps = rms_norm_eps
147
+ self.pretraining_tp = pretraining_tp
148
+ self.use_cache = use_cache
149
+ self.rope_theta = rope_theta
150
+ self.rope_scaling = rope_scaling
151
+ self._rope_scaling_validation()
152
+ self.attention_bias = attention_bias
153
+ self.attention_dropout = attention_dropout
154
+
155
+ super().__init__(
156
+ pad_token_id=pad_token_id,
157
+ bos_token_id=bos_token_id,
158
+ eos_token_id=eos_token_id,
159
+ tie_word_embeddings=tie_word_embeddings,
160
+ **kwargs,
161
+ )
162
+
163
+ def _rope_scaling_validation(self):
164
+ """
165
+ Validate the `rope_scaling` configuration.
166
+ """
167
+ if self.rope_scaling is None:
168
+ return
169
+
170
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
171
+ raise ValueError(
172
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
173
+ f"got {self.rope_scaling}"
174
+ )
175
+ rope_scaling_type = self.rope_scaling.get("type", None)
176
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
177
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
178
+ raise ValueError(
179
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
180
+ )
181
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
182
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.44.2"
6
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30d4a620ece23e72fe9796adc7a21c03d29aa31a96ced28e4f63b4574c2ed864
3
+ size 468472622
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "</s>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c41fcc6d44fcc4e8269e41dffe0123687baf800bd95a9c8b5d48abd9cb8971b
3
+ size 991189
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "legacy": true,
35
+ "model_max_length": 1000000000000000019884624838656,
36
+ "pad_token": "</s>",
37
+ "sp_model_kwargs": {},
38
+ "spaces_between_special_tokens": false,
39
+ "tokenizer_class": "LlamaTokenizer",
40
+ "unk_token": "<unk>",
41
+ "use_default_system_prompt": false
42
+ }
vulavulaslm.py ADDED
@@ -0,0 +1,948 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ from torch import nn
26
+ import torch.nn.functional as F
27
+ import torch.utils.checkpoint
28
+ from torch import nn
29
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
30
+
31
+ from transformers.activations import ACT2FN
32
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
35
+ from transformers.models.llama.configuration_llama import LlamaConfig
36
+
37
+ logger = logging.get_logger(__name__)
38
+
39
+ _CONFIG_FOR_DOC = "LlamaConfig"
40
+
41
+ def get_gpu_architecture():
42
+ if torch.cuda.is_available():
43
+ device_name = torch.cuda.get_device_name(0).lower()
44
+ if 't4' in device_name or 'v100' in device_name or 'p100' in device_name:
45
+ return 'Turing' # T4 is in the Turing family
46
+ elif 'a100' in device_name or 'h100' in device_name:
47
+ return 'Ampere' # A100 and H100 are in the Ampere family
48
+ elif 'h8000' in device_name or 'a8000' in device_name:
49
+ return 'Hopper' # H8000 is in the Hopper family
50
+ elif 'rtx' in device_name:
51
+ return 'Ada' # RTX series is in the Ada family
52
+ else:
53
+ return 'Unknown'
54
+ else:
55
+ return 'No GPU available'
56
+
57
+ use_flash_attention_from_library = True
58
+ try:
59
+ from flash_attn import flash_attn_func
60
+ except:
61
+ # hack to work woth T4 GPUs which needs an 1.x version (using currently 1.0.9)
62
+ # Import the triton implementation (torch.nn.functional version only)
63
+ use_flash_attention_from_library = False
64
+ pass
65
+
66
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
67
+ def _make_causal_mask(
68
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
69
+ ):
70
+ """
71
+ Make causal mask used for bi-directional self-attention.
72
+ """
73
+ bsz, tgt_len = input_ids_shape
74
+ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
75
+ mask_cond = torch.arange(mask.size(-1), device=device)
76
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
77
+ mask = mask.to(dtype)
78
+
79
+ if past_key_values_length > 0:
80
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
81
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
82
+
83
+
84
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
85
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
86
+ """
87
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
88
+ """
89
+ bsz, src_len = mask.size()
90
+ tgt_len = tgt_len if tgt_len is not None else src_len
91
+
92
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
93
+
94
+ inverted_mask = 1.0 - expanded_mask
95
+
96
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
97
+
98
+
99
+ class VulavulaLlamaRMSNorm(nn.Module):
100
+ def __init__(self, hidden_size, eps=1e-6):
101
+ """
102
+ LlamaRMSNorm is equivalent to T5LayerNorm
103
+ """
104
+ super().__init__()
105
+ self.weight = nn.Parameter(torch.ones(hidden_size))
106
+ self.variance_epsilon = eps
107
+
108
+ def forward(self, hidden_states):
109
+ input_dtype = hidden_states.dtype
110
+ variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
111
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
112
+
113
+ return (self.weight * hidden_states).to(input_dtype)
114
+
115
+
116
+ class VulavulaLlamaRotaryEmbedding(torch.nn.Module):
117
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
118
+ super().__init__()
119
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
120
+ self.register_buffer("inv_freq", inv_freq)
121
+
122
+ # Build here to make `torch.jit.trace` work.
123
+ self.max_seq_len_cached = max_position_embeddings
124
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
125
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
126
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
127
+ emb = torch.cat((freqs, freqs), dim=-1)
128
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False)
129
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False)
130
+
131
+ def forward(self, x, seq_len=None):
132
+ # x: [bs, num_attention_heads, seq_len, head_size]
133
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
134
+ if seq_len > self.max_seq_len_cached:
135
+ self.max_seq_len_cached = seq_len
136
+ t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
137
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
138
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
139
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
140
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False)
141
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False)
142
+ return (
143
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
144
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
145
+ )
146
+
147
+
148
+ def rotate_half(x):
149
+ """Rotates half the hidden dims of the input."""
150
+ x1 = x[..., : x.shape[-1] // 2]
151
+ x2 = x[..., x.shape[-1] // 2 :]
152
+ return torch.cat((-x2, x1), dim=-1)
153
+
154
+
155
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
156
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
157
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
158
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
159
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
160
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
161
+ q_embed = (q * cos) + (rotate_half(q) * sin)
162
+ k_embed = (k * cos) + (rotate_half(k) * sin)
163
+ return q_embed, k_embed
164
+
165
+
166
+ class VulavulaLlamaMLP(nn.Module):
167
+ def __init__(
168
+ self,
169
+ hidden_size: int,
170
+ intermediate_size: int,
171
+ hidden_act: str,
172
+ ):
173
+ super().__init__()
174
+ self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
175
+ self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
176
+ self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
177
+ self.act_fn = ACT2FN[hidden_act]
178
+
179
+ def forward(self, x):
180
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
181
+
182
+
183
+ class VulavulaLlamaAttention(nn.Module):
184
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
185
+
186
+ def __init__(self, config: LlamaConfig):
187
+ super().__init__()
188
+ self.config = config
189
+ self.hidden_size = config.hidden_size
190
+ self.num_heads = config.num_attention_heads
191
+ self.head_dim = self.hidden_size // self.num_heads
192
+ self.max_position_embeddings = config.max_position_embeddings
193
+
194
+ if (self.head_dim * self.num_heads) != self.hidden_size:
195
+ raise ValueError(
196
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
197
+ f" and `num_heads`: {self.num_heads})."
198
+ )
199
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
200
+ self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
201
+ self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
202
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
203
+ self.rotary_emb = VulavulaLlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
204
+
205
+ # self.gpu_architecture = get_gpu_architecture()
206
+ # if self.gpu_architecture in ['Ampere', 'Ada', 'Hopper']:
207
+ # self.use_flash_attn = True
208
+ # else:
209
+ # self.use_flash_attn = False
210
+ self.use_flash_attn = use_flash_attention_from_library
211
+
212
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
213
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
214
+
215
+ def forward(
216
+ self,
217
+ hidden_states: torch.Tensor,
218
+ attention_mask: Optional[torch.Tensor] = None,
219
+ position_ids: Optional[torch.LongTensor] = None,
220
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
221
+ output_attentions: bool = False,
222
+ use_cache: bool = False,
223
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
224
+ bsz, q_len, _ = hidden_states.size()
225
+
226
+ query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
227
+ key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
228
+ value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
229
+
230
+ kv_seq_len = key_states.shape[-2]
231
+ if past_key_value is not None:
232
+ kv_seq_len += past_key_value[0].shape[-2]
233
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
234
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
235
+
236
+ if past_key_value is not None:
237
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
238
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
239
+
240
+ past_key_value = (key_states, value_states) if use_cache else None
241
+
242
+ if self.use_flash_attn:
243
+ attn_output = flash_attn_func(q=query_states.transpose(1, 2).to(torch.bfloat16),
244
+ k=key_states.transpose(1, 2).to(torch.bfloat16),
245
+ v=value_states.transpose(1, 2).to(torch.bfloat16),
246
+ causal=True)
247
+ else:
248
+ attn_output = self.custom_flash_attention(query_states.transpose(1, 2).to(torch.bfloat16),
249
+ key_states.transpose(1, 2).to(torch.bfloat16),
250
+ value_states.transpose(1, 2).to(torch.bfloat16),
251
+ causal=True)
252
+
253
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
254
+ attn_output = attn_output.to(query_states.dtype)
255
+
256
+ attn_output = self.o_proj(attn_output)
257
+ assert not output_attentions
258
+ attn_weights = None
259
+
260
+ return attn_output, attn_weights, past_key_value
261
+
262
+ def custom_flash_attention(self, q, k, v, dropout_p=0.0, softmax_scale=None, causal=False):
263
+ """
264
+ Compute the FlashAttention.
265
+ Args:
266
+ q: Queries tensor of shape (batch_size, seq_len, num_heads, head_dim).
267
+ k: Keys tensor of shape (batch_size, seq_len, num_heads, head_dim).
268
+ v: Values tensor of shape (batch_size, seq_len, num_heads, head_dim).
269
+ dropout_p: Dropout probability.
270
+ softmax_scale: Scaling factor for QK^T before applying softmax. Defaults to 1 / sqrt(head_dim).
271
+ causal: Whether to apply causal attention mask (e.g., for autoregressive modeling).
272
+
273
+ Returns:
274
+ Output tensor of shape (batch_size, seq_len, num_heads, head_dim).
275
+ """
276
+ batch_size, seq_len, num_heads, head_dim = q.size()
277
+
278
+ if softmax_scale is None:
279
+ softmax_scale = 1.0 / (head_dim ** 0.5)
280
+
281
+ # Compute raw attention scores
282
+ attn_scores = torch.einsum('bqhd,bkhd->bhqk', q, k) * softmax_scale
283
+
284
+ # Apply causal mask if needed
285
+ if causal:
286
+ causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=q.device, dtype=torch.bool)).unsqueeze(0).unsqueeze(0)
287
+ attn_scores = attn_scores.masked_fill(~causal_mask, float('-inf'))
288
+
289
+ # Compute attention probabilities
290
+ attn_probs = F.softmax(attn_scores, dim=-1)
291
+ attn_probs = F.dropout(attn_probs, p=dropout_p, training=True)
292
+
293
+ # Compute the output
294
+ output = torch.einsum('bhqk,bkhd->bqhd', attn_probs, v)
295
+
296
+ return output
297
+
298
+ def standard_attention(self, q, k, v, mask=None):
299
+ d_k = q.size(-1)
300
+ scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
301
+
302
+ if mask is not None:
303
+ scores = scores + mask
304
+
305
+ attn_weights = F.softmax(scores, dim=-1)
306
+ attn_output = torch.matmul(attn_weights, v)
307
+ return attn_output
308
+
309
+
310
+ class VulavulaLlamaDecoderLayer(nn.Module):
311
+ def __init__(self, config: LlamaConfig, mlp):
312
+ super().__init__()
313
+ self.hidden_size = config.hidden_size
314
+ self.self_attn = VulavulaLlamaAttention(config=config)
315
+ self.mlp = mlp #LlamaMLP(hidden_size=self.hidden_size,intermediate_size=config.intermediate_size,hidden_act=config.hidden_act,)
316
+ self.input_layernorm = VulavulaLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
317
+ self.post_attention_layernorm = VulavulaLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
318
+
319
+ def forward(
320
+ self,
321
+ hidden_states: torch.Tensor,
322
+ attention_mask: Optional[torch.Tensor] = None,
323
+ position_ids: Optional[torch.LongTensor] = None,
324
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
325
+ output_attentions: Optional[bool] = False,
326
+ use_cache: Optional[bool] = False,
327
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
328
+ """
329
+ Args:
330
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
331
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
332
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
333
+ output_attentions (`bool`, *optional*):
334
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
335
+ returned tensors for more detail.
336
+ use_cache (`bool`, *optional*):
337
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
338
+ (see `past_key_values`).
339
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
340
+ """
341
+
342
+ residual = hidden_states
343
+
344
+ hidden_states = self.input_layernorm(hidden_states)
345
+
346
+ # Self Attention
347
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
348
+ hidden_states=hidden_states,
349
+ attention_mask=attention_mask,
350
+ position_ids=position_ids,
351
+ past_key_value=past_key_value,
352
+ output_attentions=output_attentions,
353
+ use_cache=use_cache,
354
+ )
355
+ hidden_states = residual + hidden_states
356
+
357
+ # Fully Connected
358
+ residual = hidden_states
359
+ hidden_states = self.post_attention_layernorm(hidden_states)
360
+ hidden_states = self.mlp(hidden_states)
361
+ hidden_states = residual + hidden_states
362
+
363
+ outputs = (hidden_states,)
364
+
365
+ if output_attentions:
366
+ outputs += (self_attn_weights,)
367
+
368
+ if use_cache:
369
+ outputs += (present_key_value,)
370
+
371
+ return outputs
372
+
373
+
374
+ VULAVULALLAMA_START_DOCSTRING = r"""
375
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
376
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
377
+ etc.)
378
+
379
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
380
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
381
+ and behavior.
382
+
383
+ Parameters:
384
+ config ([`LlamaConfig`]):
385
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
386
+ load the weights associated with the model, only the configuration. Check out the
387
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
388
+ """
389
+
390
+
391
+ @add_start_docstrings(
392
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
393
+ VULAVULALLAMA_START_DOCSTRING,
394
+ )
395
+ class VulavulaLlamaPreTrainedModel(PreTrainedModel):
396
+ config_class = LlamaConfig
397
+ base_model_prefix = "model"
398
+ supports_gradient_checkpointing = True
399
+ _no_split_modules = ["LlamaDecoderLayer"]
400
+ _skip_keys_device_placement = "past_key_values"
401
+ _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
402
+
403
+ def _init_weights(self, module):
404
+ std = self.config.initializer_range
405
+ if isinstance(module, nn.Linear):
406
+ module.weight.data.normal_(mean=0.0, std=std)
407
+ if module.bias is not None:
408
+ module.bias.data.zero_()
409
+ elif isinstance(module, nn.Embedding):
410
+ module.weight.data.normal_(mean=0.0, std=std)
411
+ if module.padding_idx is not None:
412
+ module.weight.data[module.padding_idx].zero_()
413
+
414
+ def _set_gradient_checkpointing(self, module, value=False):
415
+ if isinstance(module, VulavulaLlamaModel):
416
+ module.gradient_checkpointing = value
417
+
418
+
419
+ VULAVULALLAMA_INPUTS_DOCSTRING = r"""
420
+ Args:
421
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
422
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
423
+ it.
424
+
425
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
426
+ [`PreTrainedTokenizer.__call__`] for details.
427
+
428
+ [What are input IDs?](../glossary#input-ids)
429
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
430
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
431
+
432
+ - 1 for tokens that are **not masked**,
433
+ - 0 for tokens that are **masked**.
434
+
435
+ [What are attention masks?](../glossary#attention-mask)
436
+
437
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
438
+ [`PreTrainedTokenizer.__call__`] for details.
439
+
440
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
441
+ `past_key_values`).
442
+
443
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
444
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
445
+ information on the default strategy.
446
+
447
+ - 1 indicates the head is **not masked**,
448
+ - 0 indicates the head is **masked**.
449
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
450
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
451
+ config.n_positions - 1]`.
452
+
453
+ [What are position IDs?](../glossary#position-ids)
454
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
455
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
456
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
457
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
458
+
459
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
460
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
461
+
462
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
463
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
464
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
465
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
466
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
467
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
468
+ model's internal embedding lookup matrix.
469
+ use_cache (`bool`, *optional*):
470
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
471
+ `past_key_values`).
472
+ output_attentions (`bool`, *optional*):
473
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
474
+ tensors for more detail.
475
+ output_hidden_states (`bool`, *optional*):
476
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
477
+ more detail.
478
+ return_dict (`bool`, *optional*):
479
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
480
+ """
481
+
482
+
483
+ @add_start_docstrings(
484
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
485
+ VULAVULALLAMA_INPUTS_DOCSTRING,
486
+ )
487
+ class VulavulaLlamaModel(VulavulaLlamaPreTrainedModel):
488
+ """
489
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`VulavulaLlamaDecoderLayer`]
490
+
491
+ Args:
492
+ config: LlamaConfig
493
+ """
494
+
495
+ def __init__(self, config: LlamaConfig):
496
+ super().__init__(config)
497
+ self.padding_idx = config.pad_token_id
498
+ self.vocab_size = config.vocab_size
499
+
500
+ embed_tokens_down = nn.Embedding(config.vocab_size, 512, self.padding_idx)
501
+ embed_tokens_up = nn.Linear(512, config.hidden_size, bias=False)
502
+ self.embed_tokens = nn.Sequential(embed_tokens_down, embed_tokens_up)
503
+ mlp = VulavulaLlamaMLP(
504
+ hidden_size=config.hidden_size,
505
+ intermediate_size=config.intermediate_size,
506
+ hidden_act=config.hidden_act,
507
+ )
508
+ self.layers = nn.ModuleList([VulavulaLlamaDecoderLayer(config, mlp) for _ in range(config.num_hidden_layers)])
509
+ self.norm = VulavulaLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
510
+
511
+ self.gradient_checkpointing = False
512
+ # Initialize weights and apply final processing
513
+ self.post_init()
514
+
515
+ def get_input_embeddings(self):
516
+ return self.embed_tokens
517
+
518
+ def set_input_embeddings(self, value):
519
+ self.embed_tokens = value
520
+
521
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
522
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
523
+ # create causal mask
524
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
525
+ combined_attention_mask = None
526
+ if input_shape[-1] > 1:
527
+ combined_attention_mask = _make_causal_mask(
528
+ input_shape,
529
+ inputs_embeds.dtype,
530
+ device=inputs_embeds.device,
531
+ past_key_values_length=past_key_values_length,
532
+ )
533
+
534
+ if attention_mask is not None:
535
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
536
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
537
+ inputs_embeds.device
538
+ )
539
+ combined_attention_mask = (
540
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
541
+ )
542
+
543
+ return combined_attention_mask
544
+
545
+ @add_start_docstrings_to_model_forward(VULAVULALLAMA_INPUTS_DOCSTRING)
546
+ def forward(
547
+ self,
548
+ input_ids: torch.LongTensor = None,
549
+ attention_mask: Optional[torch.Tensor] = None,
550
+ position_ids: Optional[torch.LongTensor] = None,
551
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
552
+ inputs_embeds: Optional[torch.FloatTensor] = None,
553
+ use_cache: Optional[bool] = None,
554
+ output_attentions: Optional[bool] = None,
555
+ output_hidden_states: Optional[bool] = None,
556
+ return_dict: Optional[bool] = None,
557
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
558
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
559
+ output_hidden_states = (
560
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
561
+ )
562
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
563
+
564
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
565
+
566
+ # retrieve input_ids and inputs_embeds
567
+ if input_ids is not None and inputs_embeds is not None:
568
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
569
+ elif input_ids is not None:
570
+ batch_size, seq_length = input_ids.shape
571
+ elif inputs_embeds is not None:
572
+ batch_size, seq_length, _ = inputs_embeds.shape
573
+ else:
574
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
575
+
576
+ seq_length_with_past = seq_length
577
+ past_key_values_length = 0
578
+
579
+ if past_key_values is not None:
580
+ past_key_values_length = past_key_values[0][0].shape[2]
581
+ seq_length_with_past = seq_length_with_past + past_key_values_length
582
+
583
+ if position_ids is None:
584
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
585
+ position_ids = torch.arange(
586
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
587
+ )
588
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
589
+ else:
590
+ position_ids = position_ids.view(-1, seq_length).long()
591
+
592
+ if inputs_embeds is None:
593
+ inputs_embeds = self.embed_tokens(input_ids)
594
+ # embed positions
595
+ if attention_mask is None:
596
+ attention_mask = torch.ones(
597
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
598
+ )
599
+ attention_mask = self._prepare_decoder_attention_mask(
600
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
601
+ )
602
+
603
+ hidden_states = inputs_embeds
604
+
605
+ if self.gradient_checkpointing and self.training:
606
+ if use_cache:
607
+ logger.warning_once(
608
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
609
+ )
610
+ use_cache = False
611
+
612
+ # decoder layers
613
+ all_hidden_states = () if output_hidden_states else None
614
+ all_self_attns = () if output_attentions else None
615
+ next_decoder_cache = () if use_cache else None
616
+
617
+ for idx, decoder_layer in enumerate(self.layers):
618
+ if output_hidden_states:
619
+ all_hidden_states += (hidden_states,)
620
+
621
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
622
+
623
+ if self.gradient_checkpointing and self.training:
624
+
625
+ def create_custom_forward(module):
626
+ def custom_forward(*inputs):
627
+ # None for past_key_value
628
+ return module(*inputs, output_attentions, None)
629
+
630
+ return custom_forward
631
+
632
+ layer_outputs = torch.utils.checkpoint.checkpoint(
633
+ create_custom_forward(decoder_layer),
634
+ hidden_states,
635
+ attention_mask,
636
+ position_ids,
637
+ None,
638
+ )
639
+ else:
640
+ layer_outputs = decoder_layer(
641
+ hidden_states,
642
+ attention_mask=attention_mask,
643
+ position_ids=position_ids,
644
+ past_key_value=past_key_value,
645
+ output_attentions=output_attentions,
646
+ use_cache=use_cache,
647
+ )
648
+
649
+ hidden_states = layer_outputs[0]
650
+
651
+ if use_cache:
652
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
653
+
654
+ if output_attentions:
655
+ all_self_attns += (layer_outputs[1],)
656
+
657
+ hidden_states = self.norm(hidden_states)
658
+
659
+ # add hidden states from the last decoder layer
660
+ if output_hidden_states:
661
+ all_hidden_states += (hidden_states,)
662
+
663
+ next_cache = next_decoder_cache if use_cache else None
664
+ if not return_dict:
665
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
666
+ return BaseModelOutputWithPast(
667
+ last_hidden_state=hidden_states,
668
+ past_key_values=next_cache,
669
+ hidden_states=all_hidden_states,
670
+ attentions=all_self_attns,
671
+ )
672
+
673
+
674
+ class VulavulaLlamaForCausalLM(VulavulaLlamaPreTrainedModel):
675
+ def __init__(self, config):
676
+ super().__init__(config)
677
+ self.model = VulavulaLlamaModel(config)
678
+
679
+ self.lm_head = nn.Sequential(nn.Linear(config.hidden_size, 512, bias=False),
680
+ nn.Linear(512, config.vocab_size, bias=False))
681
+
682
+ # Initialize weights and apply final processing
683
+ self.post_init()
684
+
685
+ def get_input_embeddings(self):
686
+ return self.model.embed_tokens
687
+
688
+ def set_input_embeddings(self, value):
689
+ self.model.embed_tokens = value
690
+
691
+ def get_output_embeddings(self):
692
+ return self.lm_head
693
+
694
+ def set_output_embeddings(self, new_embeddings):
695
+ self.lm_head = new_embeddings
696
+
697
+ def set_decoder(self, decoder):
698
+ self.model = decoder
699
+
700
+ def get_decoder(self):
701
+ return self.model
702
+
703
+ @add_start_docstrings_to_model_forward(VULAVULALLAMA_INPUTS_DOCSTRING)
704
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
705
+ def forward(
706
+ self,
707
+ input_ids: torch.LongTensor = None,
708
+ attention_mask: Optional[torch.Tensor] = None,
709
+ position_ids: Optional[torch.LongTensor] = None,
710
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
711
+ inputs_embeds: Optional[torch.FloatTensor] = None,
712
+ labels: Optional[torch.LongTensor] = None,
713
+ use_cache: Optional[bool] = None,
714
+ output_attentions: Optional[bool] = None,
715
+ output_hidden_states: Optional[bool] = None,
716
+ return_dict: Optional[bool] = None,
717
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
718
+ r"""
719
+ Args:
720
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
721
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
722
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
723
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
724
+
725
+ Returns:
726
+
727
+ Example:
728
+
729
+ ```python
730
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
731
+
732
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
733
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
734
+
735
+ >>> prompt = "Hey, are you consciours? Can you talk to me?"
736
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
737
+
738
+ >>> # Generate
739
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
740
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
741
+ "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
742
+ ```"""
743
+
744
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
745
+ output_hidden_states = (
746
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
747
+ )
748
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
749
+
750
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
751
+ outputs = self.model(
752
+ input_ids=input_ids,
753
+ attention_mask=attention_mask,
754
+ position_ids=position_ids,
755
+ past_key_values=past_key_values,
756
+ inputs_embeds=inputs_embeds,
757
+ use_cache=use_cache,
758
+ output_attentions=output_attentions,
759
+ output_hidden_states=output_hidden_states,
760
+ return_dict=return_dict,
761
+ )
762
+
763
+ hidden_states = outputs[0]
764
+ logits = self.lm_head(hidden_states)
765
+
766
+ loss = None
767
+ if labels is not None:
768
+ # Shift so that tokens < n predict n
769
+ shift_logits = logits[..., :-1, :].contiguous()
770
+ shift_labels = labels[..., 1:].contiguous()
771
+ # Flatten the tokens
772
+ loss_fct = CrossEntropyLoss()
773
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
774
+ shift_labels = shift_labels.view(-1)
775
+ # Enable model parallelism
776
+ shift_labels = shift_labels.to(shift_logits.device)
777
+ loss = loss_fct(shift_logits, shift_labels)
778
+
779
+ if not return_dict:
780
+ output = (logits,) + outputs[1:]
781
+ return (loss,) + output if loss is not None else output
782
+
783
+ return CausalLMOutputWithPast(
784
+ loss=loss,
785
+ logits=logits,
786
+ past_key_values=outputs.past_key_values,
787
+ hidden_states=outputs.hidden_states,
788
+ attentions=outputs.attentions,
789
+ )
790
+
791
+ def prepare_inputs_for_generation(
792
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
793
+ ):
794
+ if past_key_values:
795
+ input_ids = input_ids[:, -1:]
796
+
797
+ position_ids = kwargs.get("position_ids", None)
798
+ if attention_mask is not None and position_ids is None:
799
+ # create position_ids on the fly for batch generation
800
+ position_ids = attention_mask.long().cumsum(-1) - 1
801
+ position_ids.masked_fill_(attention_mask == 0, 1)
802
+ if past_key_values:
803
+ position_ids = position_ids[:, -1].unsqueeze(-1)
804
+
805
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
806
+ if inputs_embeds is not None and past_key_values is None:
807
+ model_inputs = {"inputs_embeds": inputs_embeds}
808
+ else:
809
+ model_inputs = {"input_ids": input_ids}
810
+
811
+ model_inputs.update(
812
+ {
813
+ "position_ids": position_ids,
814
+ "past_key_values": past_key_values,
815
+ "use_cache": kwargs.get("use_cache"),
816
+ "attention_mask": attention_mask,
817
+ }
818
+ )
819
+ return model_inputs
820
+
821
+ @staticmethod
822
+ def _reorder_cache(past_key_values, beam_idx):
823
+ reordered_past = ()
824
+ for layer_past in past_key_values:
825
+ reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
826
+ return reordered_past
827
+
828
+
829
+ @add_start_docstrings(
830
+ """
831
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
832
+
833
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
834
+ (e.g. GPT-2) do.
835
+
836
+ Since it does classification on the last token, it requires to know the position of the last token. If a
837
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
838
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
839
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
840
+ each row of the batch).
841
+ """,
842
+ VULAVULALLAMA_START_DOCSTRING,
843
+ )
844
+ class VulavulaLlamaForSequenceClassification(VulavulaLlamaPreTrainedModel):
845
+ _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
846
+
847
+ def __init__(self, config):
848
+ super().__init__(config)
849
+ self.num_labels = config.num_labels
850
+ self.model = VulavulaLlamaModel(config)
851
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
852
+
853
+ # Initialize weights and apply final processing
854
+ self.post_init()
855
+
856
+ def get_input_embeddings(self):
857
+ return self.model.embed_tokens
858
+
859
+ def set_input_embeddings(self, value):
860
+ self.model.embed_tokens = value
861
+
862
+ @add_start_docstrings_to_model_forward(VULAVULALLAMA_INPUTS_DOCSTRING)
863
+ def forward(
864
+ self,
865
+ input_ids: torch.LongTensor = None,
866
+ attention_mask: Optional[torch.Tensor] = None,
867
+ position_ids: Optional[torch.LongTensor] = None,
868
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
869
+ inputs_embeds: Optional[torch.FloatTensor] = None,
870
+ labels: Optional[torch.LongTensor] = None,
871
+ use_cache: Optional[bool] = None,
872
+ output_attentions: Optional[bool] = None,
873
+ output_hidden_states: Optional[bool] = None,
874
+ return_dict: Optional[bool] = None,
875
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
876
+ r"""
877
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
878
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
879
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
880
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
881
+ """
882
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
883
+
884
+ transformer_outputs = self.model(
885
+ input_ids,
886
+ attention_mask=attention_mask,
887
+ position_ids=position_ids,
888
+ past_key_values=past_key_values,
889
+ inputs_embeds=inputs_embeds,
890
+ use_cache=use_cache,
891
+ output_attentions=output_attentions,
892
+ output_hidden_states=output_hidden_states,
893
+ return_dict=return_dict,
894
+ )
895
+ hidden_states = transformer_outputs[0]
896
+ logits = self.score(hidden_states)
897
+
898
+ if input_ids is not None:
899
+ batch_size = input_ids.shape[0]
900
+ else:
901
+ batch_size = inputs_embeds.shape[0]
902
+
903
+ if self.config.pad_token_id is None and batch_size != 1:
904
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
905
+ if self.config.pad_token_id is None:
906
+ sequence_lengths = -1
907
+ else:
908
+ if input_ids is not None:
909
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
910
+ else:
911
+ sequence_lengths = -1
912
+
913
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
914
+
915
+ loss = None
916
+ if labels is not None:
917
+ labels = labels.to(logits.device)
918
+ if self.config.problem_type is None:
919
+ if self.num_labels == 1:
920
+ self.config.problem_type = "regression"
921
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
922
+ self.config.problem_type = "single_label_classification"
923
+ else:
924
+ self.config.problem_type = "multi_label_classification"
925
+
926
+ if self.config.problem_type == "regression":
927
+ loss_fct = MSELoss()
928
+ if self.num_labels == 1:
929
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
930
+ else:
931
+ loss = loss_fct(pooled_logits, labels)
932
+ elif self.config.problem_type == "single_label_classification":
933
+ loss_fct = CrossEntropyLoss()
934
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
935
+ elif self.config.problem_type == "multi_label_classification":
936
+ loss_fct = BCEWithLogitsLoss()
937
+ loss = loss_fct(pooled_logits, labels)
938
+ if not return_dict:
939
+ output = (pooled_logits,) + transformer_outputs[1:]
940
+ return ((loss,) + output) if loss is not None else output
941
+
942
+ return SequenceClassifierOutputWithPast(
943
+ loss=loss,
944
+ logits=pooled_logits,
945
+ past_key_values=transformer_outputs.past_key_values,
946
+ hidden_states=transformer_outputs.hidden_states,
947
+ attentions=transformer_outputs.attentions,
948
+ )