Redwood0 commited on
Commit
2d1a72d
1 Parent(s): b8c3ccc

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - lemonilia/LimaRP
5
+ - PygmalionAI/PIPPA
6
+ language:
7
+ - en
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - roleplay
11
+ ---
12
+
13
+ **Deepsex-34b**
14
+
15
+ gguf:https://huggingface.co/zzlgreat/deepsex-34b-gguf
16
+ exl2:https://huggingface.co/waldie/deepsex-34b-4bpw-h6-exl2
17
+
18
+ Here are the steps to make this model:
19
+ 1. I first collected a total collection of about 4GB of various light novels, and used BERT to perform two rounds of similarity deduplication on the novels with similar plots in the data set. In addition, a portion of nsfw novels are mixed in to improve the NSFW capabilities of the model.
20
+ 2. Then use the YI-34B-base as the base of the model, use the setting of r=64 alpha=128 and use qlora to fine-tune 3 epochs for continuous pre-training.
21
+ 3. Prepare the limarp+pippa data set, clean it into alpaca format, and use [goliath-120b](https://huggingface.co/alpindale/goliath-120b), which is good at role-playing, to score each question and answer pair, and filter out the high-quality ones. 30k data.
22
+ 4. Use the data in 3 for sft on the base model obtained in 2, 6 epochs, r=16 alpha=32 for fine-tuning.
23
+
24
+ *Format*
25
+
26
+ alpaca
27
+
28
+ ```
29
+ [
30
+ {
31
+ "instruction": "user instruction (required)",
32
+ "input": "user input (optional)",
33
+ "output": "model response (required)",
34
+ "history": [
35
+ ["user instruction in the first round (optional)", "model response in the first round (optional)"],
36
+ ["user instruction in the second round (optional)", "model response in the second round (optional)"]
37
+ ]
38
+ }
39
+ ]
40
+ ```
41
+
42
+ *Effect*:
43
+ Proficient in role-playing skills, while being highly accepted on NSFW, pure love words will appear from time to time. like:
44
+ ```#3
45
+ Sweaty old man
46
+ December 5, 2023 2:03 PM
47
+ Fxxk, you are such a xxx!
48
+
49
+ #4
50
+ 27.3s
51
+ Mirai
52
+ December 5, 2023 2:03 PM
53
+ "Of course I do! I can't break promises, Sweaty old man. We have been together since we were kids. We are both best friends and lovers to end all iteration." I smiled with affection. It was clear that I meant everything I said. "We both know that you like taking command of us like this. Am I not your squirting toy, Sweaty old man?" I asked with a cute pout. "We should meet up in front of the shop after classes. I'll see you there. See you, Sweaty old man!"
54
+ ```
55
+ It feels like it’s still worth a try~
56
+
57
+ Support me [here](https://ko-fi.com/mikolisa) :)
58
+
59
+
60
+ **Deepsex-34b**
61
+
62
+ *步骤*
63
+ 1. 我先收集了各种轻小说大约4GB的总集,通过bert对该数据集中剧情比较相似的小说进行了两轮相似度去重。另外混入了一部分nsfw小说以提高该模型的NSFW能力。
64
+ 2. 然后将该模型以YI-34B-base为基座,使用r=64 alpha=128 的设置使用qlora微调了3个epochs来进行持续预训练。
65
+ 3. 准备limarp+pippa数据集,统一清洗为alpaca格式,并且使用比较擅长角色扮演的[goliath-120b](https://huggingface.co/alpindale/goliath-120b)对每个问答对进行打分,筛选出其中质量高的大约30k数据。
66
+ 4. 对2中得到的base模型使用3中的数据进行sft,6个epochs,r=16 alpha=32进行微调。
67
+
68
+ *格式*
69
+
70
+ alpaca
71
+ ```[
72
+ {
73
+ "instruction": "user instruction (required)",
74
+ "input": "user input (optional)",
75
+ "output": "model response (required)",
76
+ "history": [
77
+ ["user instruction in the first round (optional)", "model response in the first round (optional)"],
78
+ ["user instruction in the second round (optional)", "model response in the second round (optional)"]
79
+ ]
80
+ }
81
+ ]```
82
+
83
+ *效果*
84
+ 熟练的角色扮演技能,在NSFW上有很高接受度的同时,会时不时的出现纯爱的话语。如:
85
+ ```#3
86
+ Sweaty old man
87
+ December 5, 2023 2:03 PM
88
+ Fxxk, you are such a xxx!
89
+
90
+ #4
91
+ 27.3s
92
+ Mirai
93
+ December 5, 2023 2:03 PM
94
+ "Of course I do! I can't break promises, Sweaty old man. We have been together since we were kids. We are both best friends and lovers to end all iteration." I smiled with affection. It was clear that I meant everything I said. "We both know that you like taking command of us like this. Am I not your squirting toy, Sweaty old man?" I asked with a cute pout. "We should meet up in front of the shop after classes. I'll see you there. See you, Sweaty old man!"
95
+ ```
96
+ 感觉还是很值得一试的~
97
+ 如果觉得好用,欢迎支持我一杯 [咖啡](https://ko-fi.com/mikolisa) :)
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/DATA4T/text-generation-webui/models/yi-sex",
3
+ "architectures": [
4
+ "YiForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_yi.YiConfig",
8
+ "AutoModel": "modeling_yi.YiForCausalLM",
9
+ "AutoModelForCausalLM": "modeling_yi.YiForCausalLM"
10
+ },
11
+ "bos_token_id": 1,
12
+ "eos_token_id": 2,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 7168,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 20480,
17
+ "max_position_embeddings": 4096,
18
+ "model_type": "Yi",
19
+ "num_attention_heads": 56,
20
+ "num_hidden_layers": 60,
21
+ "num_key_value_heads": 8,
22
+ "pad_token_id": 0,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_theta": 5000000.0,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.34.1",
28
+ "use_cache": true,
29
+ "vocab_size": 64000
30
+ }
configuration_yi.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Yi model configuration"""
2
+ from transformers.configuration_utils import PretrainedConfig
3
+ from transformers.utils import logging
4
+
5
+ logger = logging.get_logger(__name__)
6
+
7
+ Yi_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
8
+
9
+
10
+ class YiConfig(PretrainedConfig):
11
+ r"""
12
+ This is the configuration class to store the configuration of a [`YiModel`]. It is used to instantiate an Yi
13
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
14
+ defaults will yield a similar configuration to that of the Yi model.
15
+
16
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
17
+ documentation from [`PretrainedConfig`] for more information.
18
+
19
+
20
+ Args:
21
+ vocab_size (`int`, *optional*, defaults to 64000):
22
+ Vocabulary size of the Yi model. Defines the number of different tokens that can be represented by the
23
+ `inputs_ids` passed when calling [`YiModel`]
24
+ hidden_size (`int`, *optional*, defaults to 4096):
25
+ Dimension of the hidden representations.
26
+ intermediate_size (`int`, *optional*, defaults to 11008):
27
+ Dimension of the MLP representations.
28
+ num_hidden_layers (`int`, *optional*, defaults to 32):
29
+ Number of hidden layers in the Transformer encoder.
30
+ num_attention_heads (`int`, *optional*, defaults to 32):
31
+ Number of attention heads for each attention layer in the Transformer encoder.
32
+ num_key_value_heads (`int`, *optional*):
33
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
34
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
35
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
36
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
37
+ by meanpooling all the original heads within that group. For more details checkout [this
38
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
39
+ `num_attention_heads`.
40
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
41
+ The non-linear activation function (function or string) in the decoder.
42
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
43
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
44
+ just in case (e.g., 512 or 1024 or 2048 or 4096).
45
+ initializer_range (`float`, *optional*, defaults to 0.02):
46
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
47
+ rms_norm_eps (`float`, *optional*, defaults to 1e-5):
48
+ The epsilon used by the rms normalization layers.
49
+ use_cache (`bool`, *optional*, defaults to `True`):
50
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
51
+ relevant if `config.is_decoder=True`.
52
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
53
+ Whether to tie weight embeddings
54
+ output_attentions (`bool`, *optional*, defaults to `False`):
55
+ Whether or not to output attentions.
56
+ rope_theta (`float`, *optional*, defaults to 5000000.0):
57
+ The base period of the RoPE embeddings.
58
+ Example:
59
+
60
+ ```python
61
+ >>> from transformers import YiModel, YiConfig
62
+
63
+ >>> # Initializing a Yi style configuration
64
+ >>> configuration = YiConfig()
65
+
66
+ >>> # Initializing a model from the Yi style configuration
67
+ >>> model = YiModel(configuration)
68
+
69
+ >>> # Accessing the model configuration
70
+ >>> configuration = model.config
71
+ ```"""
72
+ model_type = "Yi"
73
+ keys_to_ignore_at_inference = ["past_key_values"]
74
+
75
+ def __init__(
76
+ self,
77
+ vocab_size=64000,
78
+ hidden_size=4096,
79
+ intermediate_size=11008,
80
+ num_hidden_layers=32,
81
+ num_attention_heads=32,
82
+ num_key_value_heads=4,
83
+ hidden_act="silu",
84
+ max_position_embeddings=4096,
85
+ initializer_range=0.02,
86
+ rms_norm_eps=1e-5,
87
+ use_cache=True,
88
+ pad_token_id=0,
89
+ bos_token_id=1,
90
+ eos_token_id=2,
91
+ tie_word_embeddings=False,
92
+ output_attentions=False,
93
+ rope_theta=5000000.0,
94
+ **kwargs,
95
+ ):
96
+ self.vocab_size = vocab_size
97
+ self.max_position_embeddings = max_position_embeddings
98
+ self.hidden_size = hidden_size
99
+ self.intermediate_size = intermediate_size
100
+ self.num_hidden_layers = num_hidden_layers
101
+ self.num_attention_heads = num_attention_heads
102
+
103
+ # for backward compatibility
104
+ if num_key_value_heads is None:
105
+ num_key_value_heads = num_attention_heads
106
+
107
+ self.num_key_value_heads = num_key_value_heads
108
+ self.hidden_act = hidden_act
109
+ self.initializer_range = initializer_range
110
+ self.rms_norm_eps = rms_norm_eps
111
+ self.use_cache = use_cache
112
+ self.output_attentions = output_attentions
113
+ self.rope_theta = rope_theta
114
+
115
+ super().__init__(
116
+ pad_token_id=pad_token_id,
117
+ bos_token_id=bos_token_id,
118
+ eos_token_id=eos_token_id,
119
+ tie_word_embeddings=tie_word_embeddings,
120
+ **kwargs,
121
+ )
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.34.1"
7
+ }
modeling_yi.py ADDED
@@ -0,0 +1,1030 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ PyTorch Yi model."""
2
+ import math
3
+ from typing import List, Optional, Tuple, Union
4
+
5
+ import torch.utils.checkpoint
6
+ from einops import repeat
7
+ from torch import nn
8
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
9
+
10
+ from transformers.activations import ACT2FN
11
+ from transformers.modeling_outputs import (
12
+ BaseModelOutputWithPast,
13
+ CausalLMOutputWithPast,
14
+ SequenceClassifierOutputWithPast,
15
+ )
16
+ from transformers.modeling_utils import PreTrainedModel
17
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
18
+ from transformers.utils import (
19
+ add_start_docstrings,
20
+ add_start_docstrings_to_model_forward,
21
+ is_flash_attn_available,
22
+ logging,
23
+ replace_return_docstrings,
24
+ )
25
+
26
+ from .configuration_yi import YiConfig
27
+
28
+
29
+ if is_flash_attn_available():
30
+ from flash_attn import flash_attn_func
31
+
32
+
33
+ logger = logging.get_logger(__name__)
34
+
35
+ _CONFIG_FOR_DOC = "YiConfig"
36
+
37
+
38
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
39
+ def _make_causal_mask(
40
+ input_ids_shape: torch.Size,
41
+ dtype: torch.dtype,
42
+ device: torch.device,
43
+ past_key_values_length: int = 0,
44
+ ):
45
+ """
46
+ Make causal mask used for bi-directional self-attention.
47
+ """
48
+ bsz, tgt_len = input_ids_shape
49
+ mask = torch.full(
50
+ (tgt_len, tgt_len),
51
+ torch.tensor(torch.finfo(dtype).min, device=device),
52
+ device=device,
53
+ )
54
+ mask_cond = torch.arange(mask.size(-1), device=device)
55
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
56
+ mask = mask.to(dtype)
57
+
58
+ if past_key_values_length > 0:
59
+ mask = torch.cat(
60
+ [
61
+ torch.zeros(
62
+ tgt_len, past_key_values_length, dtype=dtype, device=device
63
+ ),
64
+ mask,
65
+ ],
66
+ dim=-1,
67
+ )
68
+ return mask[None, None, :, :].expand(
69
+ bsz, 1, tgt_len, tgt_len + past_key_values_length
70
+ )
71
+
72
+
73
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
74
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
75
+ """
76
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
77
+ """
78
+ bsz, src_len = mask.size()
79
+ tgt_len = tgt_len if tgt_len is not None else src_len
80
+
81
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
82
+
83
+ inverted_mask = 1.0 - expanded_mask
84
+
85
+ return inverted_mask.masked_fill(
86
+ inverted_mask.to(torch.bool), torch.finfo(dtype).min
87
+ )
88
+
89
+
90
+ class YiRMSNorm(nn.Module):
91
+ def __init__(self, hidden_size, eps=1e-5):
92
+ """
93
+ YiRMSNorm is equivalent to T5LayerNorm
94
+ """
95
+ super().__init__()
96
+ self.weight = nn.Parameter(torch.ones(hidden_size))
97
+ self.variance_epsilon = eps
98
+
99
+ def forward(self, hidden_states):
100
+ input_dtype = hidden_states.dtype
101
+ hidden_states = hidden_states.to(torch.float32)
102
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
103
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
104
+
105
+ return self.weight * hidden_states.to(input_dtype)
106
+
107
+
108
+ ALL_LAYERNORM_LAYERS.append(YiRMSNorm)
109
+
110
+
111
+ class YiRotaryEmbedding(torch.nn.Module):
112
+ def __init__(self, dim, max_position_embeddings=4096, base=5000000, device=None):
113
+ super().__init__()
114
+
115
+ self.dim = dim
116
+ self.max_position_embeddings = max_position_embeddings
117
+ self.base = base
118
+
119
+ # Build here to make `torch.jit.trace` work.
120
+ self._set_cos_sin_cache(seq_len=max_position_embeddings, device=device)
121
+
122
+ def _set_cos_sin_cache(self, seq_len, device):
123
+ self.max_seq_len_cached = seq_len
124
+ inv_freq = 1.0 / (
125
+ self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
126
+ )
127
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
128
+ freqs = torch.einsum("i,j->ij", t, inv_freq)
129
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
130
+ emb = torch.cat((freqs, freqs), dim=-1)
131
+ self.register_buffer(
132
+ "cos_cached", emb.cos()[None, None, :, :], persistent=False
133
+ )
134
+ self.register_buffer(
135
+ "sin_cached", emb.sin()[None, None, :, :], persistent=False
136
+ )
137
+
138
+ def forward(self, x, seq_len=None):
139
+ # x: [bs, num_attention_heads, seq_len, head_size]
140
+ if seq_len > self.max_seq_len_cached:
141
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device)
142
+
143
+ return (
144
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
145
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
146
+ )
147
+
148
+
149
+ def rotate_half(x):
150
+ """Rotates half the hidden dims of the input."""
151
+ x1 = x[..., : x.shape[-1] // 2]
152
+ x2 = x[..., x.shape[-1] // 2 :]
153
+ return torch.cat((-x2, x1), dim=-1)
154
+
155
+
156
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, flash_attn_available):
157
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
158
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
159
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
160
+ expand_dim = 2 if flash_attn_available else 1
161
+ cos = cos[position_ids].unsqueeze(expand_dim) # [bs, seq_len, dim]
162
+ sin = sin[position_ids].unsqueeze(expand_dim) # [bs, seq_len, dim]
163
+ q_embed = (q * cos) + (rotate_half(q) * sin)
164
+ k_embed = (k * cos) + (rotate_half(k) * sin)
165
+ return q_embed, k_embed
166
+
167
+
168
+ class YiMLP(nn.Module):
169
+ def __init__(self, hidden_size: int, intermediate_size: int, hidden_act: str):
170
+ super().__init__()
171
+ self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
172
+ self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
173
+ self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
174
+ self.act_fn = ACT2FN[hidden_act]
175
+
176
+ def forward(self, x):
177
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
178
+
179
+
180
+ class YiAttention(nn.Module):
181
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
182
+
183
+ def __init__(self, config: YiConfig):
184
+ super().__init__()
185
+ self.config = config
186
+ self.hidden_size = config.hidden_size
187
+ self.num_heads = config.num_attention_heads
188
+ self.head_dim = self.hidden_size // self.num_heads
189
+ self.num_key_value_heads = config.num_key_value_heads
190
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
191
+ self.max_position_embeddings = config.max_position_embeddings
192
+
193
+ if (self.head_dim * self.num_heads) != self.hidden_size:
194
+ raise ValueError(
195
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
196
+ f" and `num_heads`: {self.num_heads})."
197
+ )
198
+ self.q_proj = nn.Linear(
199
+ self.hidden_size, self.num_heads * self.head_dim, bias=False
200
+ )
201
+ self.k_proj = nn.Linear(
202
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False
203
+ )
204
+ self.v_proj = nn.Linear(
205
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False
206
+ )
207
+ self.o_proj = nn.Linear(
208
+ self.num_heads * self.head_dim, self.hidden_size, bias=False
209
+ )
210
+
211
+ self.rotary_emb = YiRotaryEmbedding(
212
+ self.head_dim,
213
+ max_position_embeddings=self.max_position_embeddings,
214
+ base=self.config.rope_theta,
215
+ )
216
+
217
+ def forward(
218
+ self,
219
+ hidden_states: torch.Tensor,
220
+ attention_mask: Optional[torch.Tensor] = None,
221
+ position_ids: Optional[torch.LongTensor] = None,
222
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
223
+ output_attentions: bool = False,
224
+ use_cache: bool = False,
225
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
226
+ bsz, q_len, _ = hidden_states.size()
227
+ flash_attn_available = is_flash_attn_available()
228
+
229
+ query_states = self.q_proj(hidden_states).view(
230
+ bsz, q_len, self.num_heads, self.head_dim
231
+ )
232
+
233
+ key_states = self.k_proj(hidden_states).view(
234
+ bsz, q_len, self.num_key_value_heads, self.head_dim
235
+ )
236
+ value_states = self.v_proj(hidden_states).view(
237
+ bsz, q_len, self.num_key_value_heads, self.head_dim
238
+ )
239
+
240
+ if not flash_attn_available:
241
+ if self.num_key_value_groups > 1:
242
+ key_states = repeat(
243
+ key_states, f"b n h d -> b n (h {self.num_key_value_groups}) d"
244
+ )
245
+ value_states = repeat(
246
+ value_states, f"b n h d -> b n (h {self.num_key_value_groups}) d"
247
+ )
248
+
249
+ # b n h d -> b h n d
250
+ query_states = query_states.transpose(1, 2)
251
+ key_states = key_states.transpose(1, 2)
252
+ value_states = value_states.transpose(1, 2)
253
+
254
+ seq_dim = 1 if flash_attn_available else 2
255
+ kv_seq_len = key_states.shape[seq_dim]
256
+ if past_key_value is not None:
257
+ kv_seq_len += past_key_value[0].shape[seq_dim]
258
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
259
+ query_states, key_states = apply_rotary_pos_emb(
260
+ query_states, key_states, cos, sin, position_ids, flash_attn_available
261
+ )
262
+
263
+ if past_key_value is not None:
264
+ # reuse k, v, self_attention
265
+ key_states = torch.cat([past_key_value[0], key_states], dim=seq_dim)
266
+ value_states = torch.cat([past_key_value[1], value_states], dim=seq_dim)
267
+
268
+ past_key_value = (key_states, value_states) if use_cache else None
269
+
270
+ if flash_attn_available:
271
+ attn_output = flash_attn_func(
272
+ query_states, key_states, value_states, dropout_p=0.0, causal=True
273
+ )
274
+ else:
275
+ attn_weights = torch.matmul(
276
+ query_states, key_states.transpose(2, 3)
277
+ ) / math.sqrt(self.head_dim)
278
+
279
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
280
+ raise ValueError(
281
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
282
+ f" {attn_weights.size()}"
283
+ )
284
+
285
+ if attention_mask is not None:
286
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
287
+ raise ValueError(
288
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is"
289
+ f"{attention_mask.size()}"
290
+ )
291
+ attn_weights = attn_weights + attention_mask
292
+ dtype_min = torch.tensor(
293
+ torch.finfo(attn_weights.dtype).min,
294
+ device=attn_weights.device,
295
+ dtype=attn_weights.dtype,
296
+ )
297
+ attn_weights = torch.max(attn_weights, dtype_min)
298
+
299
+ # upcast attention to fp32
300
+ attn_weights = nn.functional.softmax(
301
+ attn_weights, dim=-1, dtype=torch.float32
302
+ ).to(query_states.dtype)
303
+ attn_output = torch.matmul(attn_weights, value_states)
304
+
305
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
306
+ raise ValueError(
307
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
308
+ f" {attn_output.size()}"
309
+ )
310
+
311
+ if not flash_attn_available:
312
+ attn_output = attn_output.transpose(1, 2)
313
+
314
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
315
+
316
+ attn_output = self.o_proj(attn_output)
317
+
318
+ if not output_attentions:
319
+ attn_weights = None
320
+
321
+ return attn_output, attn_weights, past_key_value
322
+
323
+
324
+ class YiDecoderLayer(nn.Module):
325
+ def __init__(self, config: YiConfig):
326
+ super().__init__()
327
+
328
+ self.hidden_size = config.hidden_size
329
+ self.self_attn = YiAttention(config=config)
330
+ self.mlp = YiMLP(
331
+ hidden_size=self.hidden_size,
332
+ intermediate_size=config.intermediate_size,
333
+ hidden_act=config.hidden_act,
334
+ )
335
+
336
+ self.ln1 = YiRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
337
+ self.ln2 = YiRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
338
+
339
+ def forward(
340
+ self,
341
+ hidden_states: torch.Tensor,
342
+ attention_mask: Optional[torch.Tensor] = None,
343
+ position_ids: Optional[torch.LongTensor] = None,
344
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
345
+ output_attentions: Optional[bool] = False,
346
+ use_cache: Optional[bool] = False,
347
+ ) -> Tuple[
348
+ torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
349
+ ]:
350
+ """
351
+ Args:
352
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
353
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
354
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
355
+ output_attentions (`bool`, *optional*):
356
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
357
+ returned tensors for more detail.
358
+ use_cache (`bool`, *optional*):
359
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
360
+ (see `past_key_values`).
361
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
362
+ """
363
+
364
+ residual = hidden_states
365
+
366
+ hidden_states = self.ln1(hidden_states)
367
+
368
+ # Self Attention
369
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
370
+ hidden_states=hidden_states,
371
+ attention_mask=attention_mask,
372
+ position_ids=position_ids,
373
+ past_key_value=past_key_value,
374
+ output_attentions=output_attentions,
375
+ use_cache=use_cache,
376
+ )
377
+ hidden_states = residual + hidden_states
378
+
379
+ # Fully Connected
380
+ residual = hidden_states
381
+ hidden_states = self.ln2(hidden_states)
382
+ hidden_states = self.mlp(hidden_states)
383
+ hidden_states = residual + hidden_states
384
+
385
+ outputs = (hidden_states,)
386
+
387
+ if output_attentions:
388
+ outputs += (self_attn_weights,)
389
+
390
+ if use_cache:
391
+ outputs += (present_key_value,)
392
+
393
+ return outputs
394
+
395
+
396
+ Yi_START_DOCSTRING = r"""
397
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
398
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
399
+ etc.)
400
+
401
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
402
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
403
+ and behavior.
404
+
405
+ Parameters:
406
+ config ([`YiConfig`]):
407
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
408
+ load the weights associated with the model, only the configuration. Check out the
409
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
410
+ """
411
+
412
+
413
+ @add_start_docstrings(
414
+ "The bare Yi Model outputting raw hidden-states without any specific head on top.",
415
+ Yi_START_DOCSTRING,
416
+ )
417
+ class YiPreTrainedModel(PreTrainedModel):
418
+ config_class = YiConfig
419
+ base_model_prefix = "model"
420
+ supports_gradient_checkpointing = True
421
+ _no_split_modules = ["YiDecoderLayer"]
422
+ _skip_keys_device_placement = "past_key_values"
423
+
424
+ def _init_weights(self, module):
425
+ std = self.config.initializer_range
426
+ if isinstance(module, nn.Linear):
427
+ module.weight.data.normal_(mean=0.0, std=std)
428
+ if module.bias is not None:
429
+ module.bias.data.zero_()
430
+ elif isinstance(module, nn.Embedding):
431
+ module.weight.data.normal_(mean=0.0, std=std)
432
+ if module.padding_idx is not None:
433
+ module.weight.data[module.padding_idx].zero_()
434
+
435
+ def _set_gradient_checkpointing(self, module, value=False):
436
+ if isinstance(module, YiModel):
437
+ module.gradient_checkpointing = value
438
+
439
+
440
+ Yi_INPUTS_DOCSTRING = r"""
441
+ Args:
442
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
443
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
444
+ it.
445
+
446
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
447
+ [`PreTrainedTokenizer.__call__`] for details.
448
+
449
+ [What are input IDs?](../glossary#input-ids)
450
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
451
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
452
+
453
+ - 1 for tokens that are **not masked**,
454
+ - 0 for tokens that are **masked**.
455
+
456
+ [What are attention masks?](../glossary#attention-mask)
457
+
458
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
459
+ [`PreTrainedTokenizer.__call__`] for details.
460
+
461
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
462
+ `past_key_values`).
463
+
464
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
465
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
466
+ information on the default strategy.
467
+
468
+ - 1 indicates the head is **not masked**,
469
+ - 0 indicates the head is **masked**.
470
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
471
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
472
+ config.n_positions - 1]`.
473
+
474
+ [What are position IDs?](../glossary#position-ids)
475
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
476
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
477
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
478
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
479
+
480
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
481
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
482
+
483
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
484
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
485
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
486
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
487
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
488
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
489
+ model's internal embedding lookup matrix.
490
+ use_cache (`bool`, *optional*):
491
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
492
+ `past_key_values`).
493
+ output_attentions (`bool`, *optional*):
494
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
495
+ tensors for more detail.
496
+ output_hidden_states (`bool`, *optional*):
497
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
498
+ more detail.
499
+ return_dict (`bool`, *optional*):
500
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
501
+ """
502
+
503
+
504
+ @add_start_docstrings(
505
+ "The bare Yi Model outputting raw hidden-states without any specific head on top.",
506
+ Yi_START_DOCSTRING,
507
+ )
508
+ class YiModel(YiPreTrainedModel):
509
+ """
510
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`YiDecoderLayer`]
511
+
512
+ Args:
513
+ config: YiConfig
514
+ """
515
+
516
+ def __init__(self, config: YiConfig):
517
+ super().__init__(config)
518
+ self.padding_idx = config.pad_token_id
519
+ self.vocab_size = config.vocab_size
520
+
521
+ self.embed_tokens = nn.Embedding(
522
+ config.vocab_size, config.hidden_size, self.padding_idx
523
+ )
524
+ self.layers = nn.ModuleList(
525
+ [YiDecoderLayer(config) for _ in range(config.num_hidden_layers)]
526
+ )
527
+
528
+ self.norm = YiRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
529
+
530
+ self.gradient_checkpointing = False
531
+ # Initialize weights and apply final processing
532
+ self.post_init()
533
+
534
+ def get_input_embeddings(self):
535
+ return self.embed_tokens
536
+
537
+ def set_input_embeddings(self, value):
538
+ self.embed_tokens = value
539
+
540
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
541
+ def _prepare_decoder_attention_mask(
542
+ self, attention_mask, input_ids, inputs_embeds, past_key_values_length
543
+ ):
544
+ input_shape = input_ids.shape
545
+ # create causal mask
546
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
547
+ combined_attention_mask = None
548
+ if input_shape[-1] > 1:
549
+ combined_attention_mask = _make_causal_mask(
550
+ input_shape,
551
+ inputs_embeds.dtype,
552
+ device=inputs_embeds.device,
553
+ past_key_values_length=past_key_values_length,
554
+ )
555
+
556
+ if attention_mask is not None:
557
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
558
+ expanded_attn_mask = _expand_mask(
559
+ attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
560
+ ).to(inputs_embeds.device)
561
+ combined_attention_mask = (
562
+ expanded_attn_mask
563
+ if combined_attention_mask is None
564
+ else expanded_attn_mask + combined_attention_mask
565
+ )
566
+
567
+ return combined_attention_mask
568
+
569
+ @add_start_docstrings_to_model_forward(Yi_INPUTS_DOCSTRING)
570
+ def forward(
571
+ self,
572
+ input_ids: torch.LongTensor = None,
573
+ attention_mask: Optional[torch.Tensor] = None,
574
+ position_ids: Optional[torch.LongTensor] = None,
575
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
576
+ inputs_embeds: Optional[torch.FloatTensor] = None,
577
+ use_cache: Optional[bool] = None,
578
+ output_attentions: Optional[bool] = None,
579
+ output_hidden_states: Optional[bool] = None,
580
+ return_dict: Optional[bool] = None,
581
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
582
+ output_attentions = (
583
+ output_attentions
584
+ if output_attentions is not None
585
+ else self.config.output_attentions
586
+ )
587
+ output_hidden_states = (
588
+ output_hidden_states
589
+ if output_hidden_states is not None
590
+ else self.config.output_hidden_states
591
+ )
592
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
593
+
594
+ return_dict = (
595
+ return_dict if return_dict is not None else self.config.use_return_dict
596
+ )
597
+
598
+ # retrieve input_ids and inputs_embeds
599
+ if input_ids is not None and inputs_embeds is not None:
600
+ raise ValueError(
601
+ "You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time"
602
+ )
603
+ elif input_ids is not None:
604
+ batch_size, seq_length = input_ids.shape
605
+ elif inputs_embeds is not None:
606
+ batch_size, seq_length, _ = inputs_embeds.shape
607
+ else:
608
+ raise ValueError(
609
+ "You have to specify either decoder_input_ids or decoder_inputs_embeds"
610
+ )
611
+
612
+ seq_length_with_past = seq_length
613
+ past_key_values_length = 0
614
+
615
+ if past_key_values is not None:
616
+ past_key_values_length = past_key_values[0][0].shape[2]
617
+ seq_length_with_past = seq_length_with_past + past_key_values_length
618
+
619
+ if position_ids is None:
620
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
621
+ position_ids = torch.arange(
622
+ past_key_values_length,
623
+ seq_length + past_key_values_length,
624
+ dtype=torch.long,
625
+ device=device,
626
+ )
627
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
628
+ else:
629
+ position_ids = position_ids.view(-1, seq_length).long()
630
+
631
+ if inputs_embeds is None:
632
+ inputs_embeds = self.embed_tokens(input_ids)
633
+
634
+ if not is_flash_attn_available():
635
+ # embed positions
636
+ if attention_mask is None:
637
+ attention_mask = torch.ones(
638
+ (batch_size, seq_length_with_past),
639
+ dtype=torch.bool,
640
+ device=inputs_embeds.device,
641
+ )
642
+ attention_mask = self._prepare_decoder_attention_mask(
643
+ attention_mask,
644
+ input_ids,
645
+ inputs_embeds,
646
+ past_key_values_length,
647
+ )
648
+ else:
649
+ attention_mask = None
650
+
651
+ hidden_states = inputs_embeds
652
+ if self.gradient_checkpointing and self.training:
653
+ if use_cache:
654
+ logger.warning_once(
655
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
656
+ )
657
+ use_cache = False
658
+
659
+ # decoder layers
660
+ all_hidden_states = () if output_hidden_states else None
661
+ all_self_attns = () if output_attentions else None
662
+ next_decoder_cache = () if use_cache else None
663
+
664
+ for idx, decoder_layer in enumerate(self.layers):
665
+ if output_hidden_states:
666
+ all_hidden_states += (hidden_states,)
667
+
668
+ past_key_value = (
669
+ past_key_values[idx] if past_key_values is not None else None
670
+ )
671
+
672
+ if self.gradient_checkpointing and self.training:
673
+
674
+ def create_custom_forward(module):
675
+ def custom_forward(*inputs):
676
+ # None for past_key_value
677
+ return module(*inputs, past_key_value, output_attentions)
678
+
679
+ return custom_forward
680
+
681
+ layer_outputs = torch.utils.checkpoint.checkpoint(
682
+ create_custom_forward(decoder_layer),
683
+ hidden_states,
684
+ attention_mask,
685
+ position_ids,
686
+ )
687
+ else:
688
+ layer_outputs = decoder_layer(
689
+ hidden_states,
690
+ attention_mask=attention_mask,
691
+ position_ids=position_ids,
692
+ past_key_value=past_key_value,
693
+ output_attentions=output_attentions,
694
+ use_cache=use_cache,
695
+ )
696
+
697
+ hidden_states = layer_outputs[0]
698
+
699
+ if use_cache:
700
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
701
+
702
+ if output_attentions:
703
+ all_self_attns += (layer_outputs[1],)
704
+
705
+ hidden_states = self.norm(hidden_states)
706
+ # add hidden states from the last decoder layer
707
+ if output_hidden_states:
708
+ all_hidden_states += (hidden_states,)
709
+
710
+ next_cache = next_decoder_cache if use_cache else None
711
+ if not return_dict:
712
+ return tuple(
713
+ v
714
+ for v in [hidden_states, next_cache, all_hidden_states, all_self_attns]
715
+ if v is not None
716
+ )
717
+ return BaseModelOutputWithPast(
718
+ last_hidden_state=hidden_states,
719
+ past_key_values=next_cache,
720
+ hidden_states=all_hidden_states,
721
+ attentions=all_self_attns,
722
+ )
723
+
724
+
725
+ class YiForCausalLM(YiPreTrainedModel):
726
+ _tied_weights_keys = ["lm_head.weight"]
727
+
728
+ def __init__(self, config):
729
+ super().__init__(config)
730
+ self.model = YiModel(config)
731
+
732
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
733
+
734
+ # Initialize weights and apply final processing
735
+ self.post_init()
736
+
737
+ def get_input_embeddings(self):
738
+ return self.model.embed_tokens
739
+
740
+ def set_input_embeddings(self, value):
741
+ self.model.embed_tokens = value
742
+
743
+ def get_output_embeddings(self):
744
+ return self.lm_head
745
+
746
+ def set_output_embeddings(self, new_embeddings):
747
+ self.lm_head = new_embeddings
748
+
749
+ def set_decoder(self, decoder):
750
+ self.model = decoder
751
+
752
+ def get_decoder(self):
753
+ return self.model
754
+
755
+ @add_start_docstrings_to_model_forward(Yi_INPUTS_DOCSTRING)
756
+ @replace_return_docstrings(
757
+ output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC
758
+ )
759
+ def forward(
760
+ self,
761
+ input_ids: torch.LongTensor = None,
762
+ attention_mask: Optional[torch.Tensor] = None,
763
+ position_ids: Optional[torch.LongTensor] = None,
764
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
765
+ inputs_embeds: Optional[torch.FloatTensor] = None,
766
+ labels: Optional[torch.LongTensor] = None,
767
+ use_cache: Optional[bool] = None,
768
+ output_attentions: Optional[bool] = None,
769
+ output_hidden_states: Optional[bool] = None,
770
+ return_dict: Optional[bool] = None,
771
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
772
+ r"""
773
+ Args:
774
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
775
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
776
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
777
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
778
+
779
+ Returns:
780
+
781
+ Example:
782
+
783
+ ```python
784
+ >>> from transformers import AutoTokenizer, YiForCausalLM
785
+
786
+ >>> model = YiForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
787
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
788
+
789
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
790
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
791
+
792
+ >>> # Generate
793
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
794
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
795
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
796
+ ```"""
797
+
798
+ output_attentions = (
799
+ output_attentions
800
+ if output_attentions is not None
801
+ else self.config.output_attentions
802
+ )
803
+ output_hidden_states = (
804
+ output_hidden_states
805
+ if output_hidden_states is not None
806
+ else self.config.output_hidden_states
807
+ )
808
+ return_dict = (
809
+ return_dict if return_dict is not None else self.config.use_return_dict
810
+ )
811
+
812
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
813
+ outputs = self.model(
814
+ input_ids=input_ids,
815
+ attention_mask=attention_mask,
816
+ position_ids=position_ids,
817
+ past_key_values=past_key_values,
818
+ inputs_embeds=inputs_embeds,
819
+ use_cache=use_cache,
820
+ output_attentions=output_attentions,
821
+ output_hidden_states=output_hidden_states,
822
+ return_dict=return_dict,
823
+ )
824
+
825
+ hidden_states = outputs[0]
826
+ logits = self.lm_head(hidden_states)
827
+
828
+ loss = None
829
+ if labels is not None:
830
+ # Shift so that tokens < n predict n
831
+ shift_logits = logits[..., :-1, :].contiguous()
832
+ shift_labels = labels[..., 1:].contiguous()
833
+ # Flatten the tokens
834
+ loss_fct = CrossEntropyLoss()
835
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
836
+ shift_labels = shift_labels.view(-1)
837
+ # Enable model parallelism
838
+ shift_labels = shift_labels.to(shift_logits.device)
839
+ loss = loss_fct(shift_logits, shift_labels)
840
+
841
+ if not return_dict:
842
+ output = (logits,) + outputs[1:]
843
+ return (loss,) + output if loss is not None else output
844
+
845
+ return CausalLMOutputWithPast(
846
+ loss=loss,
847
+ logits=logits,
848
+ past_key_values=outputs.past_key_values,
849
+ hidden_states=outputs.hidden_states,
850
+ attentions=outputs.attentions,
851
+ )
852
+
853
+ def prepare_inputs_for_generation(
854
+ self,
855
+ input_ids,
856
+ past_key_values=None,
857
+ attention_mask=None,
858
+ inputs_embeds=None,
859
+ **kwargs,
860
+ ):
861
+ if past_key_values:
862
+ input_ids = input_ids[:, -1:]
863
+
864
+ position_ids = kwargs.get("position_ids", None)
865
+ if attention_mask is not None and position_ids is None:
866
+ # create position_ids on the fly for batch generation
867
+ position_ids = attention_mask.long().cumsum(-1) - 1
868
+ position_ids.masked_fill_(attention_mask == 0, 1)
869
+ if past_key_values:
870
+ position_ids = position_ids[:, -1].unsqueeze(-1)
871
+
872
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
873
+ if inputs_embeds is not None and past_key_values is None:
874
+ model_inputs = {"inputs_embeds": inputs_embeds}
875
+ else:
876
+ model_inputs = {"input_ids": input_ids}
877
+
878
+ model_inputs.update(
879
+ {
880
+ "position_ids": position_ids,
881
+ "past_key_values": past_key_values,
882
+ "use_cache": kwargs.get("use_cache"),
883
+ "attention_mask": attention_mask,
884
+ }
885
+ )
886
+ return model_inputs
887
+
888
+ @staticmethod
889
+ def _reorder_cache(past_key_values, beam_idx):
890
+ reordered_past = ()
891
+ for layer_past in past_key_values:
892
+ reordered_past += (
893
+ tuple(
894
+ past_state.index_select(0, beam_idx.to(past_state.device))
895
+ for past_state in layer_past
896
+ ),
897
+ )
898
+ return reordered_past
899
+
900
+
901
+ @add_start_docstrings(
902
+ """
903
+ The Yi Model transformer with a sequence classification head on top (linear layer).
904
+
905
+ [`YiForSequenceClassification`] uses the last token in order to do the classification, as other causal models
906
+ (e.g. GPT-2) do.
907
+
908
+ Since it does classification on the last token, it requires to know the position of the last token. If a
909
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
910
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
911
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
912
+ each row of the batch).
913
+ """,
914
+ Yi_START_DOCSTRING,
915
+ )
916
+ class YiForSequenceClassification(YiPreTrainedModel):
917
+ def __init__(self, config):
918
+ super().__init__(config)
919
+ self.num_labels = config.num_labels
920
+ self.model = YiModel(config)
921
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
922
+
923
+ # Initialize weights and apply final processing
924
+ self.post_init()
925
+
926
+ def get_input_embeddings(self):
927
+ return self.model.embed_tokens
928
+
929
+ def set_input_embeddings(self, value):
930
+ self.model.embed_tokens = value
931
+
932
+ @add_start_docstrings_to_model_forward(Yi_INPUTS_DOCSTRING)
933
+ def forward(
934
+ self,
935
+ input_ids: torch.LongTensor = None,
936
+ attention_mask: Optional[torch.Tensor] = None,
937
+ position_ids: Optional[torch.LongTensor] = None,
938
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
939
+ inputs_embeds: Optional[torch.FloatTensor] = None,
940
+ labels: Optional[torch.LongTensor] = None,
941
+ use_cache: Optional[bool] = None,
942
+ output_attentions: Optional[bool] = None,
943
+ output_hidden_states: Optional[bool] = None,
944
+ return_dict: Optional[bool] = None,
945
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
946
+ r"""
947
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
948
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
949
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
950
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
951
+ """
952
+ return_dict = (
953
+ return_dict if return_dict is not None else self.config.use_return_dict
954
+ )
955
+
956
+ transformer_outputs = self.model(
957
+ input_ids,
958
+ attention_mask=attention_mask,
959
+ position_ids=position_ids,
960
+ past_key_values=past_key_values,
961
+ inputs_embeds=inputs_embeds,
962
+ use_cache=use_cache,
963
+ output_attentions=output_attentions,
964
+ output_hidden_states=output_hidden_states,
965
+ return_dict=return_dict,
966
+ )
967
+ hidden_states = transformer_outputs[0]
968
+ logits = self.score(hidden_states)
969
+
970
+ if input_ids is not None:
971
+ batch_size = input_ids.shape[0]
972
+ else:
973
+ batch_size = inputs_embeds.shape[0]
974
+
975
+ if self.config.pad_token_id is None and batch_size != 1:
976
+ raise ValueError(
977
+ "Cannot handle batch sizes > 1 if no padding token is defined."
978
+ )
979
+ if self.config.pad_token_id is None:
980
+ sequence_lengths = -1
981
+ else:
982
+ if input_ids is not None:
983
+ sequence_lengths = (
984
+ torch.eq(input_ids, self.config.pad_token_id).long().argmax(-1) - 1
985
+ ).to(logits.device)
986
+ else:
987
+ sequence_lengths = -1
988
+
989
+ pooled_logits = logits[
990
+ torch.arange(batch_size, device=logits.device), sequence_lengths
991
+ ]
992
+
993
+ loss = None
994
+ if labels is not None:
995
+ labels = labels.to(logits.device)
996
+ if self.config.problem_type is None:
997
+ if self.num_labels == 1:
998
+ self.config.problem_type = "regression"
999
+ elif self.num_labels > 1 and (
1000
+ labels.dtype == torch.long or labels.dtype == torch.int
1001
+ ):
1002
+ self.config.problem_type = "single_label_classification"
1003
+ else:
1004
+ self.config.problem_type = "multi_label_classification"
1005
+
1006
+ if self.config.problem_type == "regression":
1007
+ loss_fct = MSELoss()
1008
+ if self.num_labels == 1:
1009
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1010
+ else:
1011
+ loss = loss_fct(pooled_logits, labels)
1012
+ elif self.config.problem_type == "single_label_classification":
1013
+ loss_fct = CrossEntropyLoss()
1014
+ loss = loss_fct(
1015
+ pooled_logits.view(-1, self.num_labels), labels.view(-1)
1016
+ )
1017
+ elif self.config.problem_type == "multi_label_classification":
1018
+ loss_fct = BCEWithLogitsLoss()
1019
+ loss = loss_fct(pooled_logits, labels)
1020
+ if not return_dict:
1021
+ output = (pooled_logits,) + transformer_outputs[1:]
1022
+ return ((loss,) + output) if loss is not None else output
1023
+
1024
+ return SequenceClassifierOutputWithPast(
1025
+ loss=loss,
1026
+ logits=pooled_logits,
1027
+ past_key_values=transformer_outputs.past_key_values,
1028
+ hidden_states=transformer_outputs.hidden_states,
1029
+ attentions=transformer_outputs.attentions,
1030
+ )
output-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4035ba438a97dbaba01e4925fb648da745ed0b6fe326562a456e767b2e77710a
3
+ size 8532843040
output-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09a4769bd806c04dbb28ba160a96e627515e8d6b0f35a02aad674a8f014cf60d
3
+ size 8523442296
output-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9fbc6dde7b363c2fc5e1db2508047ac022cf1d3cbc4c4fae1a4512ec449100b
3
+ size 1079114248
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,550 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 68777834496
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00077-of-00077.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00077.bin",
8
+ "model.layers.0.ln1.weight": "pytorch_model-00003-of-00077.bin",
9
+ "model.layers.0.ln2.weight": "pytorch_model-00003-of-00077.bin",
10
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00002-of-00077.bin",
11
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00002-of-00077.bin",
12
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00003-of-00077.bin",
13
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00002-of-00077.bin",
14
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00002-of-00077.bin",
15
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00002-of-00077.bin",
16
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00002-of-00077.bin",
17
+ "model.layers.1.ln1.weight": "pytorch_model-00004-of-00077.bin",
18
+ "model.layers.1.ln2.weight": "pytorch_model-00004-of-00077.bin",
19
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00004-of-00077.bin",
20
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00003-of-00077.bin",
21
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00004-of-00077.bin",
22
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00003-of-00077.bin",
23
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00003-of-00077.bin",
24
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00003-of-00077.bin",
25
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00003-of-00077.bin",
26
+ "model.layers.10.ln1.weight": "pytorch_model-00015-of-00077.bin",
27
+ "model.layers.10.ln2.weight": "pytorch_model-00015-of-00077.bin",
28
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00015-of-00077.bin",
29
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00015-of-00077.bin",
30
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00015-of-00077.bin",
31
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00014-of-00077.bin",
32
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00014-of-00077.bin",
33
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00014-of-00077.bin",
34
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00014-of-00077.bin",
35
+ "model.layers.11.ln1.weight": "pytorch_model-00016-of-00077.bin",
36
+ "model.layers.11.ln2.weight": "pytorch_model-00016-of-00077.bin",
37
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00016-of-00077.bin",
38
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00016-of-00077.bin",
39
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00016-of-00077.bin",
40
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00015-of-00077.bin",
41
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00016-of-00077.bin",
42
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00015-of-00077.bin",
43
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00016-of-00077.bin",
44
+ "model.layers.12.ln1.weight": "pytorch_model-00018-of-00077.bin",
45
+ "model.layers.12.ln2.weight": "pytorch_model-00018-of-00077.bin",
46
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00017-of-00077.bin",
47
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00017-of-00077.bin",
48
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00018-of-00077.bin",
49
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00017-of-00077.bin",
50
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00017-of-00077.bin",
51
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00017-of-00077.bin",
52
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00017-of-00077.bin",
53
+ "model.layers.13.ln1.weight": "pytorch_model-00019-of-00077.bin",
54
+ "model.layers.13.ln2.weight": "pytorch_model-00019-of-00077.bin",
55
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00019-of-00077.bin",
56
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00018-of-00077.bin",
57
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00019-of-00077.bin",
58
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00018-of-00077.bin",
59
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00018-of-00077.bin",
60
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00018-of-00077.bin",
61
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00018-of-00077.bin",
62
+ "model.layers.14.ln1.weight": "pytorch_model-00020-of-00077.bin",
63
+ "model.layers.14.ln2.weight": "pytorch_model-00020-of-00077.bin",
64
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00020-of-00077.bin",
65
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00020-of-00077.bin",
66
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00020-of-00077.bin",
67
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00019-of-00077.bin",
68
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00019-of-00077.bin",
69
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00019-of-00077.bin",
70
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00019-of-00077.bin",
71
+ "model.layers.15.ln1.weight": "pytorch_model-00021-of-00077.bin",
72
+ "model.layers.15.ln2.weight": "pytorch_model-00021-of-00077.bin",
73
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00021-of-00077.bin",
74
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00021-of-00077.bin",
75
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00021-of-00077.bin",
76
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00020-of-00077.bin",
77
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00021-of-00077.bin",
78
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00020-of-00077.bin",
79
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00021-of-00077.bin",
80
+ "model.layers.16.ln1.weight": "pytorch_model-00023-of-00077.bin",
81
+ "model.layers.16.ln2.weight": "pytorch_model-00023-of-00077.bin",
82
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00022-of-00077.bin",
83
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00022-of-00077.bin",
84
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00023-of-00077.bin",
85
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00022-of-00077.bin",
86
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00022-of-00077.bin",
87
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00022-of-00077.bin",
88
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00022-of-00077.bin",
89
+ "model.layers.17.ln1.weight": "pytorch_model-00024-of-00077.bin",
90
+ "model.layers.17.ln2.weight": "pytorch_model-00024-of-00077.bin",
91
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00024-of-00077.bin",
92
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00023-of-00077.bin",
93
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00024-of-00077.bin",
94
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00023-of-00077.bin",
95
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00023-of-00077.bin",
96
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00023-of-00077.bin",
97
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00023-of-00077.bin",
98
+ "model.layers.18.ln1.weight": "pytorch_model-00025-of-00077.bin",
99
+ "model.layers.18.ln2.weight": "pytorch_model-00025-of-00077.bin",
100
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00025-of-00077.bin",
101
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00025-of-00077.bin",
102
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00025-of-00077.bin",
103
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00024-of-00077.bin",
104
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00024-of-00077.bin",
105
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00024-of-00077.bin",
106
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00024-of-00077.bin",
107
+ "model.layers.19.ln1.weight": "pytorch_model-00026-of-00077.bin",
108
+ "model.layers.19.ln2.weight": "pytorch_model-00026-of-00077.bin",
109
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00026-of-00077.bin",
110
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00026-of-00077.bin",
111
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00026-of-00077.bin",
112
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00025-of-00077.bin",
113
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00026-of-00077.bin",
114
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00025-of-00077.bin",
115
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00026-of-00077.bin",
116
+ "model.layers.2.ln1.weight": "pytorch_model-00005-of-00077.bin",
117
+ "model.layers.2.ln2.weight": "pytorch_model-00005-of-00077.bin",
118
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00005-of-00077.bin",
119
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00005-of-00077.bin",
120
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00005-of-00077.bin",
121
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00004-of-00077.bin",
122
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00004-of-00077.bin",
123
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00004-of-00077.bin",
124
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00004-of-00077.bin",
125
+ "model.layers.20.ln1.weight": "pytorch_model-00028-of-00077.bin",
126
+ "model.layers.20.ln2.weight": "pytorch_model-00028-of-00077.bin",
127
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00027-of-00077.bin",
128
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00027-of-00077.bin",
129
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00028-of-00077.bin",
130
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00027-of-00077.bin",
131
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00027-of-00077.bin",
132
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00027-of-00077.bin",
133
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00027-of-00077.bin",
134
+ "model.layers.21.ln1.weight": "pytorch_model-00029-of-00077.bin",
135
+ "model.layers.21.ln2.weight": "pytorch_model-00029-of-00077.bin",
136
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00029-of-00077.bin",
137
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00028-of-00077.bin",
138
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00029-of-00077.bin",
139
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00028-of-00077.bin",
140
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00028-of-00077.bin",
141
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00028-of-00077.bin",
142
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00028-of-00077.bin",
143
+ "model.layers.22.ln1.weight": "pytorch_model-00030-of-00077.bin",
144
+ "model.layers.22.ln2.weight": "pytorch_model-00030-of-00077.bin",
145
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00030-of-00077.bin",
146
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00030-of-00077.bin",
147
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00030-of-00077.bin",
148
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00029-of-00077.bin",
149
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00029-of-00077.bin",
150
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00029-of-00077.bin",
151
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00029-of-00077.bin",
152
+ "model.layers.23.ln1.weight": "pytorch_model-00031-of-00077.bin",
153
+ "model.layers.23.ln2.weight": "pytorch_model-00031-of-00077.bin",
154
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00031-of-00077.bin",
155
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00031-of-00077.bin",
156
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00031-of-00077.bin",
157
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00030-of-00077.bin",
158
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00031-of-00077.bin",
159
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00030-of-00077.bin",
160
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00031-of-00077.bin",
161
+ "model.layers.24.ln1.weight": "pytorch_model-00033-of-00077.bin",
162
+ "model.layers.24.ln2.weight": "pytorch_model-00033-of-00077.bin",
163
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00032-of-00077.bin",
164
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00032-of-00077.bin",
165
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00033-of-00077.bin",
166
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00032-of-00077.bin",
167
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00032-of-00077.bin",
168
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00032-of-00077.bin",
169
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00032-of-00077.bin",
170
+ "model.layers.25.ln1.weight": "pytorch_model-00034-of-00077.bin",
171
+ "model.layers.25.ln2.weight": "pytorch_model-00034-of-00077.bin",
172
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00034-of-00077.bin",
173
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00033-of-00077.bin",
174
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00034-of-00077.bin",
175
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00033-of-00077.bin",
176
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00033-of-00077.bin",
177
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00033-of-00077.bin",
178
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00033-of-00077.bin",
179
+ "model.layers.26.ln1.weight": "pytorch_model-00035-of-00077.bin",
180
+ "model.layers.26.ln2.weight": "pytorch_model-00035-of-00077.bin",
181
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00035-of-00077.bin",
182
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00035-of-00077.bin",
183
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00035-of-00077.bin",
184
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00034-of-00077.bin",
185
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00034-of-00077.bin",
186
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00034-of-00077.bin",
187
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00034-of-00077.bin",
188
+ "model.layers.27.ln1.weight": "pytorch_model-00036-of-00077.bin",
189
+ "model.layers.27.ln2.weight": "pytorch_model-00036-of-00077.bin",
190
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00036-of-00077.bin",
191
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00036-of-00077.bin",
192
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00036-of-00077.bin",
193
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00035-of-00077.bin",
194
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00036-of-00077.bin",
195
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00035-of-00077.bin",
196
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00036-of-00077.bin",
197
+ "model.layers.28.ln1.weight": "pytorch_model-00038-of-00077.bin",
198
+ "model.layers.28.ln2.weight": "pytorch_model-00038-of-00077.bin",
199
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00037-of-00077.bin",
200
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00037-of-00077.bin",
201
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00038-of-00077.bin",
202
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00037-of-00077.bin",
203
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00037-of-00077.bin",
204
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00037-of-00077.bin",
205
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00037-of-00077.bin",
206
+ "model.layers.29.ln1.weight": "pytorch_model-00039-of-00077.bin",
207
+ "model.layers.29.ln2.weight": "pytorch_model-00039-of-00077.bin",
208
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00039-of-00077.bin",
209
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00038-of-00077.bin",
210
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00039-of-00077.bin",
211
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00038-of-00077.bin",
212
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00038-of-00077.bin",
213
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00038-of-00077.bin",
214
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00038-of-00077.bin",
215
+ "model.layers.3.ln1.weight": "pytorch_model-00006-of-00077.bin",
216
+ "model.layers.3.ln2.weight": "pytorch_model-00006-of-00077.bin",
217
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00006-of-00077.bin",
218
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00006-of-00077.bin",
219
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00006-of-00077.bin",
220
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00005-of-00077.bin",
221
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00006-of-00077.bin",
222
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00005-of-00077.bin",
223
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00006-of-00077.bin",
224
+ "model.layers.30.ln1.weight": "pytorch_model-00040-of-00077.bin",
225
+ "model.layers.30.ln2.weight": "pytorch_model-00040-of-00077.bin",
226
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00040-of-00077.bin",
227
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00040-of-00077.bin",
228
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00040-of-00077.bin",
229
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00039-of-00077.bin",
230
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00039-of-00077.bin",
231
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00039-of-00077.bin",
232
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00039-of-00077.bin",
233
+ "model.layers.31.ln1.weight": "pytorch_model-00041-of-00077.bin",
234
+ "model.layers.31.ln2.weight": "pytorch_model-00041-of-00077.bin",
235
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00041-of-00077.bin",
236
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00041-of-00077.bin",
237
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00041-of-00077.bin",
238
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00040-of-00077.bin",
239
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00041-of-00077.bin",
240
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00040-of-00077.bin",
241
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00041-of-00077.bin",
242
+ "model.layers.32.ln1.weight": "pytorch_model-00043-of-00077.bin",
243
+ "model.layers.32.ln2.weight": "pytorch_model-00043-of-00077.bin",
244
+ "model.layers.32.mlp.down_proj.weight": "pytorch_model-00042-of-00077.bin",
245
+ "model.layers.32.mlp.gate_proj.weight": "pytorch_model-00042-of-00077.bin",
246
+ "model.layers.32.mlp.up_proj.weight": "pytorch_model-00043-of-00077.bin",
247
+ "model.layers.32.self_attn.k_proj.weight": "pytorch_model-00042-of-00077.bin",
248
+ "model.layers.32.self_attn.o_proj.weight": "pytorch_model-00042-of-00077.bin",
249
+ "model.layers.32.self_attn.q_proj.weight": "pytorch_model-00042-of-00077.bin",
250
+ "model.layers.32.self_attn.v_proj.weight": "pytorch_model-00042-of-00077.bin",
251
+ "model.layers.33.ln1.weight": "pytorch_model-00044-of-00077.bin",
252
+ "model.layers.33.ln2.weight": "pytorch_model-00044-of-00077.bin",
253
+ "model.layers.33.mlp.down_proj.weight": "pytorch_model-00044-of-00077.bin",
254
+ "model.layers.33.mlp.gate_proj.weight": "pytorch_model-00043-of-00077.bin",
255
+ "model.layers.33.mlp.up_proj.weight": "pytorch_model-00044-of-00077.bin",
256
+ "model.layers.33.self_attn.k_proj.weight": "pytorch_model-00043-of-00077.bin",
257
+ "model.layers.33.self_attn.o_proj.weight": "pytorch_model-00043-of-00077.bin",
258
+ "model.layers.33.self_attn.q_proj.weight": "pytorch_model-00043-of-00077.bin",
259
+ "model.layers.33.self_attn.v_proj.weight": "pytorch_model-00043-of-00077.bin",
260
+ "model.layers.34.ln1.weight": "pytorch_model-00045-of-00077.bin",
261
+ "model.layers.34.ln2.weight": "pytorch_model-00045-of-00077.bin",
262
+ "model.layers.34.mlp.down_proj.weight": "pytorch_model-00045-of-00077.bin",
263
+ "model.layers.34.mlp.gate_proj.weight": "pytorch_model-00045-of-00077.bin",
264
+ "model.layers.34.mlp.up_proj.weight": "pytorch_model-00045-of-00077.bin",
265
+ "model.layers.34.self_attn.k_proj.weight": "pytorch_model-00044-of-00077.bin",
266
+ "model.layers.34.self_attn.o_proj.weight": "pytorch_model-00044-of-00077.bin",
267
+ "model.layers.34.self_attn.q_proj.weight": "pytorch_model-00044-of-00077.bin",
268
+ "model.layers.34.self_attn.v_proj.weight": "pytorch_model-00044-of-00077.bin",
269
+ "model.layers.35.ln1.weight": "pytorch_model-00046-of-00077.bin",
270
+ "model.layers.35.ln2.weight": "pytorch_model-00046-of-00077.bin",
271
+ "model.layers.35.mlp.down_proj.weight": "pytorch_model-00046-of-00077.bin",
272
+ "model.layers.35.mlp.gate_proj.weight": "pytorch_model-00046-of-00077.bin",
273
+ "model.layers.35.mlp.up_proj.weight": "pytorch_model-00046-of-00077.bin",
274
+ "model.layers.35.self_attn.k_proj.weight": "pytorch_model-00045-of-00077.bin",
275
+ "model.layers.35.self_attn.o_proj.weight": "pytorch_model-00046-of-00077.bin",
276
+ "model.layers.35.self_attn.q_proj.weight": "pytorch_model-00045-of-00077.bin",
277
+ "model.layers.35.self_attn.v_proj.weight": "pytorch_model-00046-of-00077.bin",
278
+ "model.layers.36.ln1.weight": "pytorch_model-00048-of-00077.bin",
279
+ "model.layers.36.ln2.weight": "pytorch_model-00048-of-00077.bin",
280
+ "model.layers.36.mlp.down_proj.weight": "pytorch_model-00047-of-00077.bin",
281
+ "model.layers.36.mlp.gate_proj.weight": "pytorch_model-00047-of-00077.bin",
282
+ "model.layers.36.mlp.up_proj.weight": "pytorch_model-00048-of-00077.bin",
283
+ "model.layers.36.self_attn.k_proj.weight": "pytorch_model-00047-of-00077.bin",
284
+ "model.layers.36.self_attn.o_proj.weight": "pytorch_model-00047-of-00077.bin",
285
+ "model.layers.36.self_attn.q_proj.weight": "pytorch_model-00047-of-00077.bin",
286
+ "model.layers.36.self_attn.v_proj.weight": "pytorch_model-00047-of-00077.bin",
287
+ "model.layers.37.ln1.weight": "pytorch_model-00049-of-00077.bin",
288
+ "model.layers.37.ln2.weight": "pytorch_model-00049-of-00077.bin",
289
+ "model.layers.37.mlp.down_proj.weight": "pytorch_model-00049-of-00077.bin",
290
+ "model.layers.37.mlp.gate_proj.weight": "pytorch_model-00048-of-00077.bin",
291
+ "model.layers.37.mlp.up_proj.weight": "pytorch_model-00049-of-00077.bin",
292
+ "model.layers.37.self_attn.k_proj.weight": "pytorch_model-00048-of-00077.bin",
293
+ "model.layers.37.self_attn.o_proj.weight": "pytorch_model-00048-of-00077.bin",
294
+ "model.layers.37.self_attn.q_proj.weight": "pytorch_model-00048-of-00077.bin",
295
+ "model.layers.37.self_attn.v_proj.weight": "pytorch_model-00048-of-00077.bin",
296
+ "model.layers.38.ln1.weight": "pytorch_model-00050-of-00077.bin",
297
+ "model.layers.38.ln2.weight": "pytorch_model-00050-of-00077.bin",
298
+ "model.layers.38.mlp.down_proj.weight": "pytorch_model-00050-of-00077.bin",
299
+ "model.layers.38.mlp.gate_proj.weight": "pytorch_model-00050-of-00077.bin",
300
+ "model.layers.38.mlp.up_proj.weight": "pytorch_model-00050-of-00077.bin",
301
+ "model.layers.38.self_attn.k_proj.weight": "pytorch_model-00049-of-00077.bin",
302
+ "model.layers.38.self_attn.o_proj.weight": "pytorch_model-00049-of-00077.bin",
303
+ "model.layers.38.self_attn.q_proj.weight": "pytorch_model-00049-of-00077.bin",
304
+ "model.layers.38.self_attn.v_proj.weight": "pytorch_model-00049-of-00077.bin",
305
+ "model.layers.39.ln1.weight": "pytorch_model-00051-of-00077.bin",
306
+ "model.layers.39.ln2.weight": "pytorch_model-00051-of-00077.bin",
307
+ "model.layers.39.mlp.down_proj.weight": "pytorch_model-00051-of-00077.bin",
308
+ "model.layers.39.mlp.gate_proj.weight": "pytorch_model-00051-of-00077.bin",
309
+ "model.layers.39.mlp.up_proj.weight": "pytorch_model-00051-of-00077.bin",
310
+ "model.layers.39.self_attn.k_proj.weight": "pytorch_model-00050-of-00077.bin",
311
+ "model.layers.39.self_attn.o_proj.weight": "pytorch_model-00051-of-00077.bin",
312
+ "model.layers.39.self_attn.q_proj.weight": "pytorch_model-00050-of-00077.bin",
313
+ "model.layers.39.self_attn.v_proj.weight": "pytorch_model-00051-of-00077.bin",
314
+ "model.layers.4.ln1.weight": "pytorch_model-00008-of-00077.bin",
315
+ "model.layers.4.ln2.weight": "pytorch_model-00008-of-00077.bin",
316
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00007-of-00077.bin",
317
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00007-of-00077.bin",
318
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00008-of-00077.bin",
319
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00007-of-00077.bin",
320
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00007-of-00077.bin",
321
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00007-of-00077.bin",
322
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00007-of-00077.bin",
323
+ "model.layers.40.ln1.weight": "pytorch_model-00053-of-00077.bin",
324
+ "model.layers.40.ln2.weight": "pytorch_model-00053-of-00077.bin",
325
+ "model.layers.40.mlp.down_proj.weight": "pytorch_model-00052-of-00077.bin",
326
+ "model.layers.40.mlp.gate_proj.weight": "pytorch_model-00052-of-00077.bin",
327
+ "model.layers.40.mlp.up_proj.weight": "pytorch_model-00053-of-00077.bin",
328
+ "model.layers.40.self_attn.k_proj.weight": "pytorch_model-00052-of-00077.bin",
329
+ "model.layers.40.self_attn.o_proj.weight": "pytorch_model-00052-of-00077.bin",
330
+ "model.layers.40.self_attn.q_proj.weight": "pytorch_model-00052-of-00077.bin",
331
+ "model.layers.40.self_attn.v_proj.weight": "pytorch_model-00052-of-00077.bin",
332
+ "model.layers.41.ln1.weight": "pytorch_model-00054-of-00077.bin",
333
+ "model.layers.41.ln2.weight": "pytorch_model-00054-of-00077.bin",
334
+ "model.layers.41.mlp.down_proj.weight": "pytorch_model-00054-of-00077.bin",
335
+ "model.layers.41.mlp.gate_proj.weight": "pytorch_model-00053-of-00077.bin",
336
+ "model.layers.41.mlp.up_proj.weight": "pytorch_model-00054-of-00077.bin",
337
+ "model.layers.41.self_attn.k_proj.weight": "pytorch_model-00053-of-00077.bin",
338
+ "model.layers.41.self_attn.o_proj.weight": "pytorch_model-00053-of-00077.bin",
339
+ "model.layers.41.self_attn.q_proj.weight": "pytorch_model-00053-of-00077.bin",
340
+ "model.layers.41.self_attn.v_proj.weight": "pytorch_model-00053-of-00077.bin",
341
+ "model.layers.42.ln1.weight": "pytorch_model-00055-of-00077.bin",
342
+ "model.layers.42.ln2.weight": "pytorch_model-00055-of-00077.bin",
343
+ "model.layers.42.mlp.down_proj.weight": "pytorch_model-00055-of-00077.bin",
344
+ "model.layers.42.mlp.gate_proj.weight": "pytorch_model-00055-of-00077.bin",
345
+ "model.layers.42.mlp.up_proj.weight": "pytorch_model-00055-of-00077.bin",
346
+ "model.layers.42.self_attn.k_proj.weight": "pytorch_model-00054-of-00077.bin",
347
+ "model.layers.42.self_attn.o_proj.weight": "pytorch_model-00054-of-00077.bin",
348
+ "model.layers.42.self_attn.q_proj.weight": "pytorch_model-00054-of-00077.bin",
349
+ "model.layers.42.self_attn.v_proj.weight": "pytorch_model-00054-of-00077.bin",
350
+ "model.layers.43.ln1.weight": "pytorch_model-00056-of-00077.bin",
351
+ "model.layers.43.ln2.weight": "pytorch_model-00056-of-00077.bin",
352
+ "model.layers.43.mlp.down_proj.weight": "pytorch_model-00056-of-00077.bin",
353
+ "model.layers.43.mlp.gate_proj.weight": "pytorch_model-00056-of-00077.bin",
354
+ "model.layers.43.mlp.up_proj.weight": "pytorch_model-00056-of-00077.bin",
355
+ "model.layers.43.self_attn.k_proj.weight": "pytorch_model-00055-of-00077.bin",
356
+ "model.layers.43.self_attn.o_proj.weight": "pytorch_model-00056-of-00077.bin",
357
+ "model.layers.43.self_attn.q_proj.weight": "pytorch_model-00055-of-00077.bin",
358
+ "model.layers.43.self_attn.v_proj.weight": "pytorch_model-00056-of-00077.bin",
359
+ "model.layers.44.ln1.weight": "pytorch_model-00058-of-00077.bin",
360
+ "model.layers.44.ln2.weight": "pytorch_model-00058-of-00077.bin",
361
+ "model.layers.44.mlp.down_proj.weight": "pytorch_model-00057-of-00077.bin",
362
+ "model.layers.44.mlp.gate_proj.weight": "pytorch_model-00057-of-00077.bin",
363
+ "model.layers.44.mlp.up_proj.weight": "pytorch_model-00058-of-00077.bin",
364
+ "model.layers.44.self_attn.k_proj.weight": "pytorch_model-00057-of-00077.bin",
365
+ "model.layers.44.self_attn.o_proj.weight": "pytorch_model-00057-of-00077.bin",
366
+ "model.layers.44.self_attn.q_proj.weight": "pytorch_model-00057-of-00077.bin",
367
+ "model.layers.44.self_attn.v_proj.weight": "pytorch_model-00057-of-00077.bin",
368
+ "model.layers.45.ln1.weight": "pytorch_model-00059-of-00077.bin",
369
+ "model.layers.45.ln2.weight": "pytorch_model-00059-of-00077.bin",
370
+ "model.layers.45.mlp.down_proj.weight": "pytorch_model-00059-of-00077.bin",
371
+ "model.layers.45.mlp.gate_proj.weight": "pytorch_model-00058-of-00077.bin",
372
+ "model.layers.45.mlp.up_proj.weight": "pytorch_model-00059-of-00077.bin",
373
+ "model.layers.45.self_attn.k_proj.weight": "pytorch_model-00058-of-00077.bin",
374
+ "model.layers.45.self_attn.o_proj.weight": "pytorch_model-00058-of-00077.bin",
375
+ "model.layers.45.self_attn.q_proj.weight": "pytorch_model-00058-of-00077.bin",
376
+ "model.layers.45.self_attn.v_proj.weight": "pytorch_model-00058-of-00077.bin",
377
+ "model.layers.46.ln1.weight": "pytorch_model-00060-of-00077.bin",
378
+ "model.layers.46.ln2.weight": "pytorch_model-00060-of-00077.bin",
379
+ "model.layers.46.mlp.down_proj.weight": "pytorch_model-00060-of-00077.bin",
380
+ "model.layers.46.mlp.gate_proj.weight": "pytorch_model-00060-of-00077.bin",
381
+ "model.layers.46.mlp.up_proj.weight": "pytorch_model-00060-of-00077.bin",
382
+ "model.layers.46.self_attn.k_proj.weight": "pytorch_model-00059-of-00077.bin",
383
+ "model.layers.46.self_attn.o_proj.weight": "pytorch_model-00059-of-00077.bin",
384
+ "model.layers.46.self_attn.q_proj.weight": "pytorch_model-00059-of-00077.bin",
385
+ "model.layers.46.self_attn.v_proj.weight": "pytorch_model-00059-of-00077.bin",
386
+ "model.layers.47.ln1.weight": "pytorch_model-00061-of-00077.bin",
387
+ "model.layers.47.ln2.weight": "pytorch_model-00061-of-00077.bin",
388
+ "model.layers.47.mlp.down_proj.weight": "pytorch_model-00061-of-00077.bin",
389
+ "model.layers.47.mlp.gate_proj.weight": "pytorch_model-00061-of-00077.bin",
390
+ "model.layers.47.mlp.up_proj.weight": "pytorch_model-00061-of-00077.bin",
391
+ "model.layers.47.self_attn.k_proj.weight": "pytorch_model-00060-of-00077.bin",
392
+ "model.layers.47.self_attn.o_proj.weight": "pytorch_model-00061-of-00077.bin",
393
+ "model.layers.47.self_attn.q_proj.weight": "pytorch_model-00060-of-00077.bin",
394
+ "model.layers.47.self_attn.v_proj.weight": "pytorch_model-00061-of-00077.bin",
395
+ "model.layers.48.ln1.weight": "pytorch_model-00063-of-00077.bin",
396
+ "model.layers.48.ln2.weight": "pytorch_model-00063-of-00077.bin",
397
+ "model.layers.48.mlp.down_proj.weight": "pytorch_model-00062-of-00077.bin",
398
+ "model.layers.48.mlp.gate_proj.weight": "pytorch_model-00062-of-00077.bin",
399
+ "model.layers.48.mlp.up_proj.weight": "pytorch_model-00063-of-00077.bin",
400
+ "model.layers.48.self_attn.k_proj.weight": "pytorch_model-00062-of-00077.bin",
401
+ "model.layers.48.self_attn.o_proj.weight": "pytorch_model-00062-of-00077.bin",
402
+ "model.layers.48.self_attn.q_proj.weight": "pytorch_model-00062-of-00077.bin",
403
+ "model.layers.48.self_attn.v_proj.weight": "pytorch_model-00062-of-00077.bin",
404
+ "model.layers.49.ln1.weight": "pytorch_model-00064-of-00077.bin",
405
+ "model.layers.49.ln2.weight": "pytorch_model-00064-of-00077.bin",
406
+ "model.layers.49.mlp.down_proj.weight": "pytorch_model-00064-of-00077.bin",
407
+ "model.layers.49.mlp.gate_proj.weight": "pytorch_model-00063-of-00077.bin",
408
+ "model.layers.49.mlp.up_proj.weight": "pytorch_model-00064-of-00077.bin",
409
+ "model.layers.49.self_attn.k_proj.weight": "pytorch_model-00063-of-00077.bin",
410
+ "model.layers.49.self_attn.o_proj.weight": "pytorch_model-00063-of-00077.bin",
411
+ "model.layers.49.self_attn.q_proj.weight": "pytorch_model-00063-of-00077.bin",
412
+ "model.layers.49.self_attn.v_proj.weight": "pytorch_model-00063-of-00077.bin",
413
+ "model.layers.5.ln1.weight": "pytorch_model-00009-of-00077.bin",
414
+ "model.layers.5.ln2.weight": "pytorch_model-00009-of-00077.bin",
415
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00009-of-00077.bin",
416
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00008-of-00077.bin",
417
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00009-of-00077.bin",
418
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00008-of-00077.bin",
419
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00008-of-00077.bin",
420
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00008-of-00077.bin",
421
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00008-of-00077.bin",
422
+ "model.layers.50.ln1.weight": "pytorch_model-00065-of-00077.bin",
423
+ "model.layers.50.ln2.weight": "pytorch_model-00065-of-00077.bin",
424
+ "model.layers.50.mlp.down_proj.weight": "pytorch_model-00065-of-00077.bin",
425
+ "model.layers.50.mlp.gate_proj.weight": "pytorch_model-00065-of-00077.bin",
426
+ "model.layers.50.mlp.up_proj.weight": "pytorch_model-00065-of-00077.bin",
427
+ "model.layers.50.self_attn.k_proj.weight": "pytorch_model-00064-of-00077.bin",
428
+ "model.layers.50.self_attn.o_proj.weight": "pytorch_model-00064-of-00077.bin",
429
+ "model.layers.50.self_attn.q_proj.weight": "pytorch_model-00064-of-00077.bin",
430
+ "model.layers.50.self_attn.v_proj.weight": "pytorch_model-00064-of-00077.bin",
431
+ "model.layers.51.ln1.weight": "pytorch_model-00066-of-00077.bin",
432
+ "model.layers.51.ln2.weight": "pytorch_model-00066-of-00077.bin",
433
+ "model.layers.51.mlp.down_proj.weight": "pytorch_model-00066-of-00077.bin",
434
+ "model.layers.51.mlp.gate_proj.weight": "pytorch_model-00066-of-00077.bin",
435
+ "model.layers.51.mlp.up_proj.weight": "pytorch_model-00066-of-00077.bin",
436
+ "model.layers.51.self_attn.k_proj.weight": "pytorch_model-00065-of-00077.bin",
437
+ "model.layers.51.self_attn.o_proj.weight": "pytorch_model-00066-of-00077.bin",
438
+ "model.layers.51.self_attn.q_proj.weight": "pytorch_model-00065-of-00077.bin",
439
+ "model.layers.51.self_attn.v_proj.weight": "pytorch_model-00066-of-00077.bin",
440
+ "model.layers.52.ln1.weight": "pytorch_model-00068-of-00077.bin",
441
+ "model.layers.52.ln2.weight": "pytorch_model-00068-of-00077.bin",
442
+ "model.layers.52.mlp.down_proj.weight": "pytorch_model-00067-of-00077.bin",
443
+ "model.layers.52.mlp.gate_proj.weight": "pytorch_model-00067-of-00077.bin",
444
+ "model.layers.52.mlp.up_proj.weight": "pytorch_model-00068-of-00077.bin",
445
+ "model.layers.52.self_attn.k_proj.weight": "pytorch_model-00067-of-00077.bin",
446
+ "model.layers.52.self_attn.o_proj.weight": "pytorch_model-00067-of-00077.bin",
447
+ "model.layers.52.self_attn.q_proj.weight": "pytorch_model-00067-of-00077.bin",
448
+ "model.layers.52.self_attn.v_proj.weight": "pytorch_model-00067-of-00077.bin",
449
+ "model.layers.53.ln1.weight": "pytorch_model-00069-of-00077.bin",
450
+ "model.layers.53.ln2.weight": "pytorch_model-00069-of-00077.bin",
451
+ "model.layers.53.mlp.down_proj.weight": "pytorch_model-00069-of-00077.bin",
452
+ "model.layers.53.mlp.gate_proj.weight": "pytorch_model-00068-of-00077.bin",
453
+ "model.layers.53.mlp.up_proj.weight": "pytorch_model-00069-of-00077.bin",
454
+ "model.layers.53.self_attn.k_proj.weight": "pytorch_model-00068-of-00077.bin",
455
+ "model.layers.53.self_attn.o_proj.weight": "pytorch_model-00068-of-00077.bin",
456
+ "model.layers.53.self_attn.q_proj.weight": "pytorch_model-00068-of-00077.bin",
457
+ "model.layers.53.self_attn.v_proj.weight": "pytorch_model-00068-of-00077.bin",
458
+ "model.layers.54.ln1.weight": "pytorch_model-00070-of-00077.bin",
459
+ "model.layers.54.ln2.weight": "pytorch_model-00070-of-00077.bin",
460
+ "model.layers.54.mlp.down_proj.weight": "pytorch_model-00070-of-00077.bin",
461
+ "model.layers.54.mlp.gate_proj.weight": "pytorch_model-00070-of-00077.bin",
462
+ "model.layers.54.mlp.up_proj.weight": "pytorch_model-00070-of-00077.bin",
463
+ "model.layers.54.self_attn.k_proj.weight": "pytorch_model-00069-of-00077.bin",
464
+ "model.layers.54.self_attn.o_proj.weight": "pytorch_model-00069-of-00077.bin",
465
+ "model.layers.54.self_attn.q_proj.weight": "pytorch_model-00069-of-00077.bin",
466
+ "model.layers.54.self_attn.v_proj.weight": "pytorch_model-00069-of-00077.bin",
467
+ "model.layers.55.ln1.weight": "pytorch_model-00071-of-00077.bin",
468
+ "model.layers.55.ln2.weight": "pytorch_model-00071-of-00077.bin",
469
+ "model.layers.55.mlp.down_proj.weight": "pytorch_model-00071-of-00077.bin",
470
+ "model.layers.55.mlp.gate_proj.weight": "pytorch_model-00071-of-00077.bin",
471
+ "model.layers.55.mlp.up_proj.weight": "pytorch_model-00071-of-00077.bin",
472
+ "model.layers.55.self_attn.k_proj.weight": "pytorch_model-00070-of-00077.bin",
473
+ "model.layers.55.self_attn.o_proj.weight": "pytorch_model-00071-of-00077.bin",
474
+ "model.layers.55.self_attn.q_proj.weight": "pytorch_model-00070-of-00077.bin",
475
+ "model.layers.55.self_attn.v_proj.weight": "pytorch_model-00071-of-00077.bin",
476
+ "model.layers.56.ln1.weight": "pytorch_model-00073-of-00077.bin",
477
+ "model.layers.56.ln2.weight": "pytorch_model-00073-of-00077.bin",
478
+ "model.layers.56.mlp.down_proj.weight": "pytorch_model-00072-of-00077.bin",
479
+ "model.layers.56.mlp.gate_proj.weight": "pytorch_model-00072-of-00077.bin",
480
+ "model.layers.56.mlp.up_proj.weight": "pytorch_model-00073-of-00077.bin",
481
+ "model.layers.56.self_attn.k_proj.weight": "pytorch_model-00072-of-00077.bin",
482
+ "model.layers.56.self_attn.o_proj.weight": "pytorch_model-00072-of-00077.bin",
483
+ "model.layers.56.self_attn.q_proj.weight": "pytorch_model-00072-of-00077.bin",
484
+ "model.layers.56.self_attn.v_proj.weight": "pytorch_model-00072-of-00077.bin",
485
+ "model.layers.57.ln1.weight": "pytorch_model-00074-of-00077.bin",
486
+ "model.layers.57.ln2.weight": "pytorch_model-00074-of-00077.bin",
487
+ "model.layers.57.mlp.down_proj.weight": "pytorch_model-00074-of-00077.bin",
488
+ "model.layers.57.mlp.gate_proj.weight": "pytorch_model-00073-of-00077.bin",
489
+ "model.layers.57.mlp.up_proj.weight": "pytorch_model-00074-of-00077.bin",
490
+ "model.layers.57.self_attn.k_proj.weight": "pytorch_model-00073-of-00077.bin",
491
+ "model.layers.57.self_attn.o_proj.weight": "pytorch_model-00073-of-00077.bin",
492
+ "model.layers.57.self_attn.q_proj.weight": "pytorch_model-00073-of-00077.bin",
493
+ "model.layers.57.self_attn.v_proj.weight": "pytorch_model-00073-of-00077.bin",
494
+ "model.layers.58.ln1.weight": "pytorch_model-00075-of-00077.bin",
495
+ "model.layers.58.ln2.weight": "pytorch_model-00075-of-00077.bin",
496
+ "model.layers.58.mlp.down_proj.weight": "pytorch_model-00075-of-00077.bin",
497
+ "model.layers.58.mlp.gate_proj.weight": "pytorch_model-00075-of-00077.bin",
498
+ "model.layers.58.mlp.up_proj.weight": "pytorch_model-00075-of-00077.bin",
499
+ "model.layers.58.self_attn.k_proj.weight": "pytorch_model-00074-of-00077.bin",
500
+ "model.layers.58.self_attn.o_proj.weight": "pytorch_model-00074-of-00077.bin",
501
+ "model.layers.58.self_attn.q_proj.weight": "pytorch_model-00074-of-00077.bin",
502
+ "model.layers.58.self_attn.v_proj.weight": "pytorch_model-00074-of-00077.bin",
503
+ "model.layers.59.ln1.weight": "pytorch_model-00076-of-00077.bin",
504
+ "model.layers.59.ln2.weight": "pytorch_model-00076-of-00077.bin",
505
+ "model.layers.59.mlp.down_proj.weight": "pytorch_model-00076-of-00077.bin",
506
+ "model.layers.59.mlp.gate_proj.weight": "pytorch_model-00076-of-00077.bin",
507
+ "model.layers.59.mlp.up_proj.weight": "pytorch_model-00076-of-00077.bin",
508
+ "model.layers.59.self_attn.k_proj.weight": "pytorch_model-00075-of-00077.bin",
509
+ "model.layers.59.self_attn.o_proj.weight": "pytorch_model-00076-of-00077.bin",
510
+ "model.layers.59.self_attn.q_proj.weight": "pytorch_model-00075-of-00077.bin",
511
+ "model.layers.59.self_attn.v_proj.weight": "pytorch_model-00076-of-00077.bin",
512
+ "model.layers.6.ln1.weight": "pytorch_model-00010-of-00077.bin",
513
+ "model.layers.6.ln2.weight": "pytorch_model-00010-of-00077.bin",
514
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00010-of-00077.bin",
515
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00010-of-00077.bin",
516
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00010-of-00077.bin",
517
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00009-of-00077.bin",
518
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00009-of-00077.bin",
519
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00009-of-00077.bin",
520
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00009-of-00077.bin",
521
+ "model.layers.7.ln1.weight": "pytorch_model-00011-of-00077.bin",
522
+ "model.layers.7.ln2.weight": "pytorch_model-00011-of-00077.bin",
523
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00011-of-00077.bin",
524
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00011-of-00077.bin",
525
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00011-of-00077.bin",
526
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00010-of-00077.bin",
527
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00011-of-00077.bin",
528
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00010-of-00077.bin",
529
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00011-of-00077.bin",
530
+ "model.layers.8.ln1.weight": "pytorch_model-00013-of-00077.bin",
531
+ "model.layers.8.ln2.weight": "pytorch_model-00013-of-00077.bin",
532
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00012-of-00077.bin",
533
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00012-of-00077.bin",
534
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00013-of-00077.bin",
535
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00012-of-00077.bin",
536
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00012-of-00077.bin",
537
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00012-of-00077.bin",
538
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00012-of-00077.bin",
539
+ "model.layers.9.ln1.weight": "pytorch_model-00014-of-00077.bin",
540
+ "model.layers.9.ln2.weight": "pytorch_model-00014-of-00077.bin",
541
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00014-of-00077.bin",
542
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00013-of-00077.bin",
543
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00014-of-00077.bin",
544
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00013-of-00077.bin",
545
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00013-of-00077.bin",
546
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00013-of-00077.bin",
547
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00013-of-00077.bin",
548
+ "model.norm.weight": "pytorch_model-00076-of-00077.bin"
549
+ }
550
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenization_yi.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from shutil import copyfile
3
+ from typing import Any, Dict, List, Optional, Tuple
4
+
5
+ import sentencepiece as spm
6
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
7
+ from transformers.utils import logging
8
+
9
+ logger = logging.get_logger(__name__)
10
+
11
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
12
+
13
+ PRETRAINED_VOCAB_FILES_MAP = {
14
+ "vocab_file": {},
15
+ "tokenizer_file": {},
16
+ }
17
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
18
+
19
+
20
+ class YiTokenizer(PreTrainedTokenizer):
21
+ """
22
+ Construct a Yi tokenizer. Based on byte-level Byte-Pair-Encoding.
23
+
24
+ Args:
25
+ vocab_file (`str`):
26
+ Path to the vocabulary file.
27
+ """
28
+
29
+ vocab_files_names = VOCAB_FILES_NAMES
30
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
31
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
32
+ model_input_names = ["input_ids", "attention_mask"]
33
+
34
+ def __init__(
35
+ self,
36
+ vocab_file,
37
+ unk_token="<unk>",
38
+ bos_token="<|startoftext|>",
39
+ eos_token="<|endoftext|>",
40
+ pad_token="<unk>",
41
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
42
+ add_bos_token=True,
43
+ add_eos_token=False,
44
+ clean_up_tokenization_spaces=False,
45
+ **kwargs,
46
+ ):
47
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
48
+ bos_token = (
49
+ AddedToken(bos_token, lstrip=False, rstrip=False)
50
+ if isinstance(bos_token, str)
51
+ else bos_token
52
+ )
53
+ eos_token = (
54
+ AddedToken(eos_token, lstrip=False, rstrip=False)
55
+ if isinstance(eos_token, str)
56
+ else eos_token
57
+ )
58
+ unk_token = (
59
+ AddedToken(unk_token, lstrip=False, rstrip=False)
60
+ if isinstance(unk_token, str)
61
+ else unk_token
62
+ )
63
+ pad_token = (
64
+ AddedToken(pad_token, lstrip=False, rstrip=False)
65
+ if isinstance(pad_token, str)
66
+ else pad_token
67
+ )
68
+ self.vocab_file = vocab_file
69
+ self.add_bos_token = add_bos_token
70
+ self.add_eos_token = add_eos_token
71
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
72
+ self.sp_model.Load(vocab_file)
73
+ super().__init__(
74
+ bos_token=bos_token,
75
+ eos_token=eos_token,
76
+ unk_token=unk_token,
77
+ pad_token=pad_token,
78
+ add_bos_token=add_bos_token,
79
+ add_eos_token=add_eos_token,
80
+ sp_model_kwargs=self.sp_model_kwargs,
81
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
82
+ **kwargs,
83
+ )
84
+
85
+ def __getstate__(self):
86
+ state = self.__dict__.copy()
87
+ state["sp_model"] = None
88
+ return state
89
+
90
+ def __setstate__(self, d):
91
+ self.__dict__ = d
92
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
93
+ self.sp_model.Load(self.vocab_file)
94
+
95
+ @property
96
+ def vocab_size(self):
97
+ """Returns vocab size"""
98
+ return self.sp_model.get_piece_size()
99
+
100
+ def get_vocab(self):
101
+ """Returns vocab as a dict"""
102
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
103
+ vocab.update(self.added_tokens_encoder)
104
+ return vocab
105
+
106
+ def _tokenize(self, text):
107
+ """Returns a tokenized string."""
108
+ return self.sp_model.encode(text, out_type=str)
109
+
110
+ def _convert_token_to_id(self, token):
111
+ """Converts a token (str) in an id using the vocab."""
112
+ return self.sp_model.piece_to_id(token)
113
+
114
+ def _convert_id_to_token(self, index):
115
+ """Converts an index (integer) in a token (str) using the vocab."""
116
+ token = self.sp_model.IdToPiece(index)
117
+ return token
118
+
119
+ def convert_tokens_to_string(self, tokens):
120
+ """Converts a sequence of tokens (string) in a single string."""
121
+ current_sub_tokens = []
122
+ out_string = ""
123
+ prev_is_special = False
124
+ for i, token in enumerate(tokens):
125
+ # make sure that special tokens are not decoded using sentencepiece model
126
+ if token in self.all_special_tokens:
127
+ if not prev_is_special and i != 0:
128
+ out_string += " "
129
+ out_string += self.sp_model.decode(current_sub_tokens) + token
130
+ prev_is_special = True
131
+ current_sub_tokens = []
132
+ else:
133
+ current_sub_tokens.append(token)
134
+ prev_is_special = False
135
+ out_string += self.sp_model.decode(current_sub_tokens)
136
+ return out_string
137
+
138
+ def save_vocabulary(
139
+ self, save_directory, filename_prefix: Optional[str] = None
140
+ ) -> Tuple[str]:
141
+ """
142
+ Save the vocabulary and special tokens file to a directory.
143
+
144
+ Args:
145
+ save_directory (`str`):
146
+ The directory in which to save the vocabulary.
147
+
148
+ Returns:
149
+ `Tuple(str)`: Paths to the files saved.
150
+ """
151
+ if not os.path.isdir(save_directory):
152
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
153
+ return
154
+ out_vocab_file = os.path.join(
155
+ save_directory,
156
+ (filename_prefix + "-" if filename_prefix else "")
157
+ + VOCAB_FILES_NAMES["vocab_file"],
158
+ )
159
+
160
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
161
+ out_vocab_file
162
+ ) and os.path.isfile(self.vocab_file):
163
+ copyfile(self.vocab_file, out_vocab_file)
164
+ elif not os.path.isfile(self.vocab_file):
165
+ with open(out_vocab_file, "wb") as fi:
166
+ content_spiece_model = self.sp_model.serialized_model_proto()
167
+ fi.write(content_spiece_model)
168
+
169
+ return (out_vocab_file,)
170
+
171
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
172
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
173
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
174
+
175
+ output = bos_token_id + token_ids_0 + eos_token_id
176
+
177
+ if token_ids_1 is not None:
178
+ output = output + bos_token_id + token_ids_1 + eos_token_id
179
+
180
+ return output
181
+
182
+ def get_special_tokens_mask(
183
+ self,
184
+ token_ids_0: List[int],
185
+ token_ids_1: Optional[List[int]] = None,
186
+ already_has_special_tokens: bool = False,
187
+ ) -> List[int]:
188
+ """
189
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
190
+ special tokens using the tokenizer `prepare_for_model` method.
191
+
192
+ Args:
193
+ token_ids_0 (`List[int]`):
194
+ List of IDs.
195
+ token_ids_1 (`List[int]`, *optional*):
196
+ Optional second list of IDs for sequence pairs.
197
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
198
+ Whether or not the token list is already formatted with special tokens for the model.
199
+
200
+ Returns:
201
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
202
+ """
203
+ if already_has_special_tokens:
204
+ return super().get_special_tokens_mask(
205
+ token_ids_0=token_ids_0,
206
+ token_ids_1=token_ids_1,
207
+ already_has_special_tokens=True,
208
+ )
209
+
210
+ bos_token_id = [1] if self.add_bos_token else []
211
+ eos_token_id = [1] if self.add_eos_token else []
212
+
213
+ if token_ids_1 is None:
214
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
215
+ return (
216
+ bos_token_id
217
+ + ([0] * len(token_ids_0))
218
+ + eos_token_id
219
+ + bos_token_id
220
+ + ([0] * len(token_ids_1))
221
+ + eos_token_id
222
+ )
223
+
224
+ def create_token_type_ids_from_sequences(
225
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
226
+ ) -> List[int]:
227
+ """
228
+ Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
229
+ sequence pair mask has the following format:
230
+
231
+ ```
232
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
233
+ | first sequence | second sequence |
234
+ ```
235
+
236
+ if token_ids_1 is None, only returns the first portion of the mask (0s).
237
+
238
+ Args:
239
+ token_ids_0 (`List[int]`):
240
+ List of ids.
241
+ token_ids_1 (`List[int]`, *optional*):
242
+ Optional second list of IDs for sequence pairs.
243
+
244
+ Returns:
245
+ `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
246
+ """
247
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
248
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
249
+
250
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
251
+
252
+ if token_ids_1 is not None:
253
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
254
+
255
+ return output
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:386c49cf943d71aa110361135338c50e38beeff0a66593480421f37b319e1a39
3
+ size 1033105
tokenizer_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<|startoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "<|endoftext|>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "auto_map": {
31
+ "AutoTokenizer": [
32
+ "tokenization_yi.YiTokenizer",
33
+ null
34
+ ]
35
+ },
36
+ "bos_token": "<|startoftext|>",
37
+ "clean_up_tokenization_spaces": false,
38
+ "eos_token": "<|endoftext|>",
39
+ "model_max_length": 4096,
40
+ "pad_token": "<unk>",
41
+ "padding_side": "left",
42
+ "sp_model_kwargs": {},
43
+ "split_special_tokens": false,
44
+ "tokenizer_class": "YiTokenizer",
45
+ "unk_token": "<unk>"
46
+ }