Text Generation
Transformers
PyTorch
English
code
gpt_jx
text-generation-inference
custom_code
KnutJaegersberg commited on
Commit
7f92913
1 Parent(s): f55f073

Upload 11 files

Browse files
README.md CHANGED
@@ -1,3 +1,180 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - tiiuae/falcon-refinedweb
5
+ - bigcode/the-stack-dedup
6
+ - cerebras/SlimPajama-627B
7
+ language:
8
+ - en
9
+ - code
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ inference: false
13
+ tags:
14
+ - text-generation-inference
15
  ---
16
+ ### **Model Description**
17
+ ***GPT-JX*** is a **3 billion paramter** autoregressive Foundational Large Language Model pre-trained on *High Quality*, *Cleaned* and *Deduplicated* **1.1 trillion tokens** of english text and code. ***GPT-JX*** uses the base architecture of traditional *Transformers Decoder* with **slight changes** which is discussed later. ***GPT-JX*** was pre-trained on tokens for **English text** and **20 Programming Languages**. ***GPT-JX*** shows impressing performance when compared to **Large Language Models with 7 billion parameters** such as **LLaMa-7B-v2, Falcon-7B & MPT-7B**.
18
+
19
+ ### **Model Architecture**
20
+ We made slight changes to the traditional *Transformers Decoder* to create the Base Architecture for our ***GPT-JX*** model, the changes are listed:
21
+
22
+ - We used the **SwiGLU activation function** in the architecture of ***GPT-JX*** instead of ReLU.
23
+
24
+ - **Attention with Linear Biases(AliBi)** was used as positional embeddings for ***GPT-JX*** instead of absolute positional embeddings(as used in traditional *Transformers Decoder*) and *Rotatary Positional Embeddings*(as used in case of **GPT-J** & **GPT-NeoX**)
25
+
26
+ ***Below is GPT-JX's architectural Specs***
27
+
28
+ - **Trainable Parameters:** *2646255776*
29
+ - **Number of Layers(n<sub>layers</sub>):** *32*
30
+ - **Dimension of the Model(d<sub>model</sub>):** *2560*
31
+ - **Dimension of Feed Forward Network(d<sub>ff</sub>):** *6826*
32
+ - **Number of Heads(n<sub>heads</sub>):** *32*
33
+ - **Dimension of each Head(d<sub>head</sub>):** *80*
34
+ - **Sequence Length(n<sub>ctx</sub>):** *8192*
35
+ - **Vocab Size(n<sub>vocab</sub>):** *50257*
36
+ - **Positional Embedding:** *AliBi*
37
+ - **Tokenizer:** *GPT-2/GPT-3*
38
+
39
+ ***GPT-JX*** was trained with the Vocaulary Size of 50257 , using the same set of BPEs as GPT-2/GPT-3.
40
+
41
+ ### **Unsupervised Training Data(Pre-Training Data)**
42
+ ***GPT-JX*** was pre-trained upon *High Quality, Cleaned* and *Deduplicated* dataset mixture consisting:
43
+
44
+ - **600B tokens** of Common Crawl english text from **RefinedWeb-Text**.
45
+
46
+ - **175B tokens** of Code among 20 Programming Languages from **The-Stack-Dedup**.
47
+
48
+ - **327B tokens** from **SlimPajama**(*C4,GitHub,Wikipedia,ArXiv,StackExchange,GutenbergBooks*)
49
+
50
+ In Total the pre-training data sums to **1.1 trillion tokens**.
51
+
52
+ ***Breif Description of the Datasets***
53
+
54
+ - **RefinedWeb-Text** is High Quality Deduplicated english **Common Crawl** Text dataset which was released by **Technology Innovation Institute**.
55
+
56
+ - **The-Stack-Dedup** is Cleaned and Deduplicated version of **The-Stack**, the dataset covers *300+ Programming Languages*, it was released by **Big Code**.
57
+
58
+ - **SlimPajama** is Cleaned, High Quality and Deduplicated version of **RedPajama-Data**, the dataset contains english text from the ***Common Crawl, C4, GitHub, Wikipedia, StackExchange and GutenBerg Book***, which was released by **Cerebras**.
59
+
60
+ ***Data Mixture Proportion***
61
+ | Dataset | Data Proportion | Tokens |
62
+ |---|---|---|
63
+ | **RefinedWeb-Text** | **54.4%** | **600B** |
64
+ | **The-Stack-Dedup** | **15.9%** | **175B** |
65
+ | **SlimPajama** | **29.7%** | **327B** |
66
+ | **Total Tokens** |---| **1.1T** |
67
+
68
+ <p><strong>&dagger;</strong> Information: GPT-JX was trained on 726*A100 40GB GPUs which were sponsored by StabilityAI and Cerebras, special thanks to StabilityAI and Cerebras for sharing their GPUs.</p>
69
+
70
+ ### **Libraries and Inference**
71
+ Libraries required to use **GPT-JX** are:
72
+ ```
73
+ pip install torch transformers
74
+ ```
75
+ ***GPT-JX*** is currently only compatiable with the ***Auto Classes of Transformers Library.***
76
+
77
+ Load ***GPT-JX*** using Transformer Auto Classes:
78
+ ```python
79
+ import torch
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+
82
+ model_repo = "alien-ai/gpt-jx-3b"
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_repo, torch_dtype = torch.float16, device_map = "auto"
85
+ )
86
+ tokenizer = AutoTokenizer.from_pretrained(model_repo)
87
+ ```
88
+
89
+ *In Future we are planning to release our own python package to perform inference and fine-tune our models in efficient and user friendly way.*
90
+
91
+ ### **Intended Use and Limitations**
92
+ ***GPT-JX*** learns an inner representation of English Language as well as Programming Languages that can be used to
93
+ extract features useful for downstream tasks. The model is best at what it was
94
+ pretrained for however, which is generating text from a prompt.
95
+
96
+ ### **Out-of-scope use**
97
+ ***GPT-JX*** is **not** intended for deployment without fine-tuning, supervision,
98
+ and/or moderation. It is not a in itself a product and cannot be used for
99
+ human-facing interactions. For example, the model may generate harmful or
100
+ offensive text. Please evaluate the risks associated with your particular use case.
101
+
102
+ ***GPT-JX*** was trained on an English-language only dataset, and is thus **not**
103
+ suitable for translation or generating text in other languages.
104
+
105
+ ### **Limitations and Biases**
106
+ The core functionality of ***GPT-JX*** is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting ***GPT-JX*** it is important to remember that the statistically most likely next token is often not the token that produces the most **"accurate"** text. Never depend upon ***GPT-JX*** to produce factually accurate output.
107
+
108
+ ### **Evaluation**
109
+ Below are some evaluation results for ***GPT-JX*** in comparision to **LLaMa-7B-v1 and Falcon-7B**.
110
+
111
+ | GPT-JX |
112
+ | Average |
113
+ | 51.9 |
114
+ | Falcon-7B |
115
+ | Average |
116
+ | 53.5 |
117
+ | LLaMa-7B-v2 |
118
+ | 55 |
119
+
120
+ ### **License**
121
+ We release ***GPT-JX*** under **MIT License(License provided by Massachusetts Institute of Technology).**
122
+ ### **Citation**
123
+ <details>
124
+
125
+ ```javascript
126
+ @article{refinedweb,
127
+ title={Attention Is All You Need},
128
+ author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
129
+ journal={arXiv preprint arXiv:1706.03762 },
130
+ eprint={1706.03762},
131
+ eprinttype = {arXiv},
132
+ url={https://arxiv.org/abs/1706.03762 },
133
+ year={2023}
134
+ }
135
+ ```
136
+ ```javascript
137
+ @article{refinedweb,
138
+ title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
139
+ author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
140
+ journal={arXiv preprint arXiv:2306.01116},
141
+ eprint={2306.01116},
142
+ eprinttype = {arXiv},
143
+ url={https://arxiv.org/abs/2306.01116},
144
+ year={2023}
145
+ }
146
+ ```
147
+ ```javascript
148
+ @article{refinedweb,
149
+ title={GLU Variants Improve Transformer},
150
+ author={Noam Shazeer},
151
+ journal={arXiv preprint arXiv:2002.05202},
152
+ eprint={2002.05202},
153
+ eprinttype = {arXiv},
154
+ url={https://arxiv.org/abs/2002.05202},
155
+ year={2023}
156
+ }
157
+ ```
158
+ ```javascript
159
+ @article{refinedweb,
160
+ title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
161
+ author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
162
+ journal={arXiv preprint arXiv:2306.01116},
163
+ eprint={2306.01116},
164
+ eprinttype = {arXiv},
165
+ url={https://arxiv.org/abs/2306.01116},
166
+ year={2023}
167
+ }
168
+ ```
169
+ ```javascript
170
+ @article{Kocetkov2022TheStack,
171
+ title={The Stack: 3 TB of permissively licensed source code},
172
+ author={Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou,Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm},
173
+ journal={Preprint},
174
+ eprint={2211.15533},
175
+ eprinttype={arXiv},
176
+ url={https://arxiv.org/abs/2211.15533}
177
+ year={2022}
178
+ }
179
+ ```
180
+ </details>
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "cerebras/btlm-3b-8k-base",
3
+ "activation_function": "swiglu",
4
+ "architectures": [
5
+ "GPTJXForCausalLM"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_gptjx.GPTJXConfig",
10
+ "AutoModel": "modeling_gptjx.GPTJXModel",
11
+ "AutoModelForCausalLM": "modeling_gptjx.GPTJXForCausalLM"
12
+ },
13
+ "bos_token_id": 50256,
14
+ "embd_pdrop": 0.0,
15
+ "mup_embeddings_scale": 14.6,
16
+ "eos_token_id": 50256,
17
+ "initializer_range": 0.073,
18
+ "layer_norm_epsilon": 1e-05,
19
+ "model_type": "gpt_jx",
20
+ "n_embd": 2560,
21
+ "n_head": 32,
22
+ "n_inner": 6826,
23
+ "n_layer": 32,
24
+ "n_positions": 8192,
25
+ "mup_output_alpha": 2.2200000000000003,
26
+ "position_embedding_type": "alibi",
27
+ "reorder_and_upcast_attn": false,
28
+ "resid_pdrop": 0.0,
29
+ "scale_attn_by_inverse_layer_idx": false,
30
+ "scale_attn_weights": true,
31
+ "mup_scale_qk_dot_by_d": true,
32
+ "torch_dtype": "bfloat16",
33
+ "transformers_version": "4.30.0",
34
+ "use_cache": true,
35
+ "vocab_size": 50257,
36
+ "mup_width_scale": 0.1
37
+ }
configuration_gptjx.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ GPT-JX configuration"""
2
+ from transformers.configuration_utils import PretrainedConfig
3
+ from transformers.utils import logging
4
+
5
+
6
+ logger = logging.get_logger(__name__)
7
+
8
+ PRETRAINED_CONFIG_ARCHIVE_MAP = {
9
+ "alien-ai/gpt-jx-3b": "https://huggingface.co/alien-ai/gpt-jx-3b/resolve/main/config.json",
10
+ }
11
+
12
+
13
+ class GPTJXConfig(PretrainedConfig):
14
+ """
15
+ This is the configuration class to store the configuration of a [`GPTJXModel`]. It is used to instantiate a GPT-JX
16
+ model according to the specified arguments, defining the model architecture.
17
+
18
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
19
+ documentation from [`PretrainedConfig`] for more information.
20
+
21
+
22
+ Args:
23
+ vocab_size (`int`, *optional*, defaults to 50257):
24
+ Vocabulary size of the GPT-JX model. Defines the number of different tokens that can be represented by the
25
+ `inputs_ids` passed when calling [`GPTJXModel`].
26
+ n_positions (`int`, *optional*, defaults to 1024):
27
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
28
+ just in case (e.g., 512 or 1024 or 2048).
29
+ n_embd (`int`, *optional*, defaults to 768):
30
+ Dimensionality of the embeddings and hidden states.
31
+ n_layer (`int`, *optional*, defaults to 12):
32
+ Number of hidden layers in the Transformer encoder.
33
+ n_head (`int`, *optional*, defaults to 12):
34
+ Number of attention heads for each attention layer in the Transformer encoder.
35
+ n_inner (`int`, *optional*, defaults to None):
36
+ Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
37
+ activation_function (`str`, *optional*, defaults to `"gelu"`):
38
+ Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new", "swiglu"]`.
39
+ resid_pdrop (`float`, *optional*, defaults to 0.1):
40
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
41
+ embd_pdrop (`float`, *optional*, defaults to 0.1):
42
+ The dropout ratio for the embeddings.
43
+ attn_pdrop (`float`, *optional*, defaults to 0.1):
44
+ The dropout ratio for the attention.
45
+ layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
46
+ The epsilon to use in the layer normalization layers.
47
+ initializer_range (`float`, *optional*, defaults to 0.02):
48
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
49
+ scale_attn_weights (`bool`, *optional*, defaults to `True`):
50
+ Scale attention weights by dividing by sqrt(hidden_size)..
51
+ use_cache (`bool`, *optional*, defaults to `True`):
52
+ Whether or not the model should return the last key/values attentions (not used by all models).
53
+ scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`):
54
+ Whether to additionally scale attention weights by `1 / layer_idx + 1`.
55
+ reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`):
56
+ Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention
57
+ dot-product/softmax to float() when training with mixed precision.
58
+ position_embedding_type (`str`, *optional*, defaults to `"learned"`):
59
+ Positional embedding can be either `"alibi"` or `"learned"`.
60
+ mup_width_scale (`float`, *optional*, defaults to 1.0):
61
+ muP parameter to scale learning rate and initializers. Calculated as (`d_model,0 / d_model`), where
62
+ `d_model` is the model's width and `d_model,0` is the proxy model's width.
63
+ mup_embeddings_scale (`float`, *optional*, defaults to 1.0):
64
+ muP parameter to scale token and position embeddings.
65
+ mup_output_alpha (`float`, *optional*, defaults to 1.0):
66
+ muP parameter to scale output logits (`output_logits_scale = mup_output_alpha * mup_width_scale`).
67
+ mup_scale_qk_dot_by_d (`bool`, *optional*, defaults to `False`):
68
+ Scale attention weights by dividing by hidden_size instead of sqrt(hidden_size). Need to set
69
+ scale_attn_weights to `True` as well.
70
+ """
71
+
72
+ model_type = "gpt_jx"
73
+ keys_to_ignore_at_inference = ["past_key_values"]
74
+ attribute_map = {
75
+ "hidden_size": "n_embd",
76
+ "max_position_embeddings": "n_positions",
77
+ "num_attention_heads": "n_head",
78
+ "num_hidden_layers": "n_layer",
79
+ }
80
+
81
+ def __init__(
82
+ self,
83
+ vocab_size=50257,
84
+ n_positions=1024,
85
+ n_embd=768,
86
+ n_layer=12,
87
+ n_head=12,
88
+ n_inner=None,
89
+ activation_function="gelu_new",
90
+ resid_pdrop=0.1,
91
+ embd_pdrop=0.1,
92
+ attn_pdrop=0.1,
93
+ layer_norm_epsilon=1e-5,
94
+ initializer_range=0.02,
95
+ scale_attn_weights=True,
96
+ use_cache=True,
97
+ bos_token_id=50256,
98
+ eos_token_id=50256,
99
+ scale_attn_by_inverse_layer_idx=False,
100
+ reorder_and_upcast_attn=False,
101
+ position_embedding_type="learned",
102
+ mup_width_scale=1.0,
103
+ mup_embeddings_scale=1.0,
104
+ mup_output_alpha=1.0,
105
+ mup_scale_qk_dot_by_d=False,
106
+ **kwargs,
107
+ ):
108
+ self.vocab_size = vocab_size
109
+ self.n_positions = n_positions
110
+ self.n_embd = n_embd
111
+ self.n_layer = n_layer
112
+ self.n_head = n_head
113
+ self.n_inner = n_inner
114
+ self.activation_function = activation_function
115
+ self.resid_pdrop = resid_pdrop
116
+ self.embd_pdrop = embd_pdrop
117
+ self.attn_pdrop = attn_pdrop
118
+ self.layer_norm_epsilon = layer_norm_epsilon
119
+ self.initializer_range = initializer_range
120
+ self.scale_attn_weights = scale_attn_weights
121
+ self.use_cache = use_cache
122
+ self.scale_attn_by_inverse_layer_idx = scale_attn_by_inverse_layer_idx
123
+ self.reorder_and_upcast_attn = reorder_and_upcast_attn
124
+
125
+ self.bos_token_id = bos_token_id
126
+ self.eos_token_id = eos_token_id
127
+
128
+ self.position_embedding_type = position_embedding_type
129
+ self.mup_width_scale = mup_width_scale
130
+ self.mup_embeddings_scale = mup_embeddings_scale
131
+ self.mup_output_alpha = mup_output_alpha
132
+ self.mup_scale_qk_dot_by_d = mup_scale_qk_dot_by_d
133
+
134
+ super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 50256,
4
+ "eos_token_id": 50256,
5
+ "transformers_version": "4.30.0"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_gptjx.py ADDED
@@ -0,0 +1,1199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ PyTorch GPT-JX model."""
2
+ import math
3
+ import os
4
+ import warnings
5
+ from typing import Optional, Tuple, Union
6
+
7
+ import torch
8
+ from torch import Tensor, nn
9
+ from torch.cuda.amp import autocast
10
+ from torch.nn import CrossEntropyLoss
11
+
12
+ from transformers.activations import ACT2FN
13
+ from transformers.modeling_outputs import (
14
+ BaseModelOutputWithPastAndCrossAttentions,
15
+ CausalLMOutputWithCrossAttentions
16
+ )
17
+ from transformers.modeling_utils import PreTrainedModel
18
+ from transformers.pytorch_utils import Conv1D, find_pruneable_heads_and_indices, prune_conv1d_layer
19
+ from transformers.utils import (
20
+ add_code_sample_docstrings,
21
+ add_start_docstrings,
22
+ add_start_docstrings_to_model_forward,
23
+ logging,
24
+ )
25
+ from transformers.utils.model_parallel_utils import assert_device_map, get_device_map
26
+ from .configuration_gptjx import GPTJXConfig
27
+
28
+
29
+ logger = logging.get_logger(__name__)
30
+
31
+ _CHECKPOINT_FOR_DOC = "alien-ai/gpt-jx-3b"
32
+ _CONFIG_FOR_DOC = "GPTJXConfig"
33
+
34
+ class SwiGLUActivation(nn.Module):
35
+ def forward(self, x1: Tensor, x2: Tensor) -> Tensor:
36
+ return x1 * nn.functional.silu(x2)
37
+
38
+
39
+ class AlibiPositionEmbeddingLayer(nn.Module):
40
+ def __init__(self, num_heads):
41
+ super(AlibiPositionEmbeddingLayer, self).__init__()
42
+
43
+ self.num_heads = num_heads
44
+ slopes = torch.tensor(AlibiPositionEmbeddingLayer._get_alibi_slopes(num_heads)).unsqueeze(-1)
45
+ self.slopes = nn.parameter.Parameter(slopes, requires_grad=False)
46
+
47
+ def forward(
48
+ self,
49
+ seq_length,
50
+ key_length,
51
+ cached_qk_len,
52
+ ):
53
+ context_position = torch.arange(
54
+ cached_qk_len, cached_qk_len + seq_length, device=self.slopes.device
55
+ )[:, None]
56
+ memory_position = torch.arange(
57
+ key_length + cached_qk_len, device=self.slopes.device
58
+ )[None, :]
59
+ relative_position = memory_position - context_position
60
+ relative_position = torch.abs(relative_position).unsqueeze(0).expand(self.num_heads, -1, -1)
61
+ alibi = (self.slopes * -1.0).unsqueeze(1) * relative_position
62
+ return alibi
63
+
64
+ @staticmethod
65
+ def _get_alibi_slopes(n):
66
+ def get_slopes_power_of_2(n):
67
+ start = 2 ** (-(2 ** -(math.log2(n) - 3)))
68
+ ratio = start
69
+ return [start * ratio**i for i in range(n)]
70
+
71
+ if math.log2(n).is_integer():
72
+ return get_slopes_power_of_2(
73
+ n
74
+ ) # In the paper, we only train models that have 2^a heads for some a. This function has
75
+ else: # some good properties that only occur when the input is a power of 2. To maintain that even
76
+ closest_power_of_2 = 2 ** math.floor(
77
+ math.log2(n)
78
+ ) # when the number of heads is not a power of 2, we use this workaround.
79
+ return (
80
+ get_slopes_power_of_2(closest_power_of_2)
81
+ + AlibiPositionEmbeddingLayer._get_alibi_slopes(2 * closest_power_of_2)[0::2][: n - closest_power_of_2]
82
+ )
83
+
84
+
85
+ def load_tf_weights_in_model(model, config, checkpoint_path):
86
+ """Load tf checkpoints in a pytorch model"""
87
+ try:
88
+ import re
89
+
90
+ import tensorflow as tf
91
+ except ImportError:
92
+ logger.error(
93
+ "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
94
+ "https://www.tensorflow.org/install/ for installation instructions."
95
+ )
96
+ raise
97
+ tf_path = os.path.abspath(checkpoint_path)
98
+ logger.info(f"Converting TensorFlow checkpoint from {tf_path}")
99
+ # Load weights from TF model
100
+ init_vars = tf.train.list_variables(tf_path)
101
+ names = []
102
+ arrays = []
103
+ for name, shape in init_vars:
104
+ logger.info(f"Loading TF weight {name} with shape {shape}")
105
+ array = tf.train.load_variable(tf_path, name)
106
+ names.append(name)
107
+ arrays.append(array.squeeze())
108
+
109
+ for name, array in zip(names, arrays):
110
+ name = name[6:] # skip "model/"
111
+ name = name.split("/")
112
+ pointer = model
113
+ for m_name in name:
114
+ if re.fullmatch(r"[A-Za-z]+\d+", m_name):
115
+ scope_names = re.split(r"(\d+)", m_name)
116
+ else:
117
+ scope_names = [m_name]
118
+ if scope_names[0] == "w" or scope_names[0] == "g":
119
+ pointer = getattr(pointer, "weight")
120
+ elif scope_names[0] == "b":
121
+ pointer = getattr(pointer, "bias")
122
+ elif scope_names[0] == "wpe" or scope_names[0] == "wte":
123
+ pointer = getattr(pointer, scope_names[0])
124
+ pointer = getattr(pointer, "weight")
125
+ else:
126
+ pointer = getattr(pointer, scope_names[0])
127
+ if len(scope_names) >= 2:
128
+ num = int(scope_names[1])
129
+ pointer = pointer[num]
130
+ try:
131
+ assert (
132
+ pointer.shape == array.shape
133
+ ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
134
+ except AssertionError as e:
135
+ e.args += (pointer.shape, array.shape)
136
+ raise
137
+ logger.info(f"Initialize PyTorch weight {name}")
138
+ pointer.data = torch.from_numpy(array)
139
+ return model
140
+
141
+
142
+ class GPTJXAttention(nn.Module):
143
+ def __init__(self, config, is_cross_attention=False, layer_idx=None):
144
+ super().__init__()
145
+
146
+ max_positions = config.max_position_embeddings
147
+ self.register_buffer(
148
+ "bias",
149
+ torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view(
150
+ 1, 1, max_positions, max_positions
151
+ ),
152
+ persistent=False,
153
+ )
154
+ self.register_buffer("masked_bias", torch.tensor(-1e4), persistent=False)
155
+
156
+ self.embed_dim = config.hidden_size
157
+ self.num_heads = config.num_attention_heads
158
+ self.head_dim = self.embed_dim // self.num_heads
159
+ self.split_size = self.embed_dim
160
+ if self.head_dim * self.num_heads != self.embed_dim:
161
+ raise ValueError(
162
+ f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
163
+ f" {self.num_heads})."
164
+ )
165
+
166
+ self.scale_attn_weights = config.scale_attn_weights
167
+ self.is_cross_attention = is_cross_attention
168
+
169
+ # Layer-wise attention scaling, reordering, and upcasting
170
+ self.scale_attn_by_inverse_layer_idx = config.scale_attn_by_inverse_layer_idx
171
+ self.layer_idx = layer_idx
172
+ self.reorder_and_upcast_attn = config.reorder_and_upcast_attn
173
+
174
+ if self.is_cross_attention:
175
+ self.c_attn = Conv1D(2 * self.embed_dim, self.embed_dim)
176
+ self.q_attn = Conv1D(self.embed_dim, self.embed_dim)
177
+ else:
178
+ self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)
179
+ self.c_proj = Conv1D(self.embed_dim, self.embed_dim)
180
+
181
+ self.attn_dropout = nn.Dropout(config.attn_pdrop)
182
+ self.resid_dropout = nn.Dropout(config.resid_pdrop)
183
+
184
+ self.pruned_heads = set()
185
+
186
+ self.attn_scale_power = 1.0 if config.mup_scale_qk_dot_by_d else 0.5
187
+
188
+ def prune_heads(self, heads):
189
+ if len(heads) == 0:
190
+ return
191
+ heads, index = find_pruneable_heads_and_indices(heads, self.num_heads, self.head_dim, self.pruned_heads)
192
+ index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])
193
+
194
+ # Prune conv1d layers
195
+ self.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)
196
+ self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)
197
+
198
+ # Update hyper params
199
+ self.split_size = (self.split_size // self.num_heads) * (self.num_heads - len(heads))
200
+ self.num_heads = self.num_heads - len(heads)
201
+ self.pruned_heads = self.pruned_heads.union(heads)
202
+
203
+ def _attn(self, query, key, value, attention_mask=None, head_mask=None, position_bias=None):
204
+ attn_weights = torch.matmul(query, key.transpose(-1, -2))
205
+
206
+ if self.scale_attn_weights:
207
+ attn_weights = attn_weights / torch.full(
208
+ [], value.size(-1) ** self.attn_scale_power, dtype=attn_weights.dtype, device=attn_weights.device
209
+ )
210
+
211
+ # Layer-wise attention scaling
212
+ if self.scale_attn_by_inverse_layer_idx:
213
+ attn_weights = attn_weights / float(self.layer_idx + 1)
214
+
215
+ if not self.is_cross_attention:
216
+ # if only "normal" attention layer implements causal mask
217
+ query_length, key_length = query.size(-2), key.size(-2)
218
+ causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
219
+ mask_value = torch.finfo(attn_weights.dtype).min
220
+ # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
221
+ # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
222
+ mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
223
+ attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
224
+
225
+ if attention_mask is not None:
226
+ # Apply the attention mask
227
+ attn_weights = attn_weights + attention_mask
228
+
229
+ if position_bias is not None:
230
+ attn_weights += position_bias.type_as(attn_weights).unsqueeze(0)
231
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)
232
+
233
+ # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
234
+ attn_weights = attn_weights.type(value.dtype)
235
+ attn_weights = self.attn_dropout(attn_weights)
236
+
237
+ # Mask heads if we want to
238
+ if head_mask is not None:
239
+ attn_weights = attn_weights * head_mask
240
+
241
+ attn_output = torch.matmul(attn_weights, value)
242
+
243
+ return attn_output, attn_weights
244
+
245
+ def _upcast_and_reordered_attn(self, query, key, value, attention_mask=None, head_mask=None, position_bias=None):
246
+ # Use `torch.baddbmm` (a bit more efficient w/ alpha param for scaling -- from Megatron-LM)
247
+ bsz, num_heads, q_seq_len, dk = query.size()
248
+ _, _, k_seq_len, _ = key.size()
249
+
250
+ # Preallocate attn_weights for `baddbmm`
251
+ attn_weights = torch.empty(bsz * num_heads, q_seq_len, k_seq_len, dtype=torch.float32, device=query.device)
252
+
253
+ # Compute Scale Factor
254
+ scale_factor = 1.0
255
+ if self.scale_attn_weights:
256
+ scale_factor /= float(value.size(-1)) ** self.attn_scale_power
257
+
258
+ if self.scale_attn_by_inverse_layer_idx:
259
+ scale_factor /= float(self.layer_idx + 1)
260
+
261
+ # Upcast (turn off autocast) and reorder (Scale K by 1 / root(dk))
262
+ with autocast(enabled=False):
263
+ q, k = query.reshape(-1, q_seq_len, dk), key.transpose(-1, -2).reshape(-1, dk, k_seq_len)
264
+ attn_weights = torch.baddbmm(attn_weights, q.float(), k.float(), beta=0, alpha=scale_factor)
265
+ attn_weights = attn_weights.reshape(bsz, num_heads, q_seq_len, k_seq_len)
266
+
267
+ if not self.is_cross_attention:
268
+ # if only "normal" attention layer implements causal mask
269
+ query_length, key_length = query.size(-2), key.size(-2)
270
+ causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
271
+ mask_value = torch.finfo(attn_weights.dtype).min
272
+ # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
273
+ # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
274
+ mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
275
+ attn_weights = torch.where(causal_mask, attn_weights, mask_value)
276
+
277
+ if attention_mask is not None:
278
+ # Apply the attention mask
279
+ attn_weights = attn_weights + attention_mask
280
+
281
+ if position_bias is not None:
282
+ attn_weights += position_bias.type_as(attn_weights).unsqueeze(0)
283
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)
284
+
285
+ # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op if otherwise
286
+ if attn_weights.dtype != torch.float32:
287
+ raise RuntimeError("Error with upcasting, attn_weights does not have dtype torch.float32")
288
+ attn_weights = attn_weights.type(value.dtype)
289
+ attn_weights = self.attn_dropout(attn_weights)
290
+
291
+ # Mask heads if we want to
292
+ if head_mask is not None:
293
+ attn_weights = attn_weights * head_mask
294
+
295
+ attn_output = torch.matmul(attn_weights, value)
296
+
297
+ return attn_output, attn_weights
298
+
299
+ def _split_heads(self, tensor, num_heads, attn_head_size):
300
+ """
301
+ Splits hidden_size dim into attn_head_size and num_heads
302
+ """
303
+ new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
304
+ tensor = tensor.view(new_shape)
305
+ return tensor.permute(0, 2, 1, 3) # (batch, head, seq_length, head_features)
306
+
307
+ def _merge_heads(self, tensor, num_heads, attn_head_size):
308
+ """
309
+ Merges attn_head_size dim and num_attn_heads dim into hidden_size
310
+ """
311
+ tensor = tensor.permute(0, 2, 1, 3).contiguous()
312
+ new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)
313
+ return tensor.view(new_shape)
314
+
315
+ def forward(
316
+ self,
317
+ hidden_states: Optional[Tuple[torch.FloatTensor]],
318
+ layer_past: Optional[Tuple[torch.Tensor]] = None,
319
+ attention_mask: Optional[torch.FloatTensor] = None,
320
+ head_mask: Optional[torch.FloatTensor] = None,
321
+ encoder_hidden_states: Optional[torch.Tensor] = None,
322
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
323
+ use_cache: Optional[bool] = False,
324
+ output_attentions: Optional[bool] = False,
325
+ position_bias: Optional[torch.FloatTensor] = None,
326
+ ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor]], ...]:
327
+ if encoder_hidden_states is not None:
328
+ if not hasattr(self, "q_attn"):
329
+ raise ValueError(
330
+ "If class is used as cross attention, the weights `q_attn` have to be defined. "
331
+ "Please make sure to instantiate class with `GPTJXAttention(..., is_cross_attention=True)`."
332
+ )
333
+
334
+ query = self.q_attn(hidden_states)
335
+ key, value = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)
336
+ attention_mask = encoder_attention_mask
337
+ else:
338
+ query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
339
+
340
+ query = self._split_heads(query, self.num_heads, self.head_dim)
341
+ key = self._split_heads(key, self.num_heads, self.head_dim)
342
+ value = self._split_heads(value, self.num_heads, self.head_dim)
343
+
344
+ if layer_past is not None:
345
+ past_key, past_value = layer_past
346
+ key = torch.cat((past_key, key), dim=-2)
347
+ value = torch.cat((past_value, value), dim=-2)
348
+
349
+ if use_cache is True:
350
+ present = (key, value)
351
+ else:
352
+ present = None
353
+
354
+ if self.reorder_and_upcast_attn:
355
+ attn_output, attn_weights = self._upcast_and_reordered_attn(
356
+ query, key, value, attention_mask, head_mask, position_bias
357
+ )
358
+ else:
359
+ attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask, position_bias)
360
+
361
+ attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
362
+ attn_output = self.c_proj(attn_output)
363
+ attn_output = self.resid_dropout(attn_output)
364
+
365
+ outputs = (attn_output, present)
366
+ if output_attentions:
367
+ outputs += (attn_weights,)
368
+
369
+ return outputs # a, present, (attentions)
370
+
371
+
372
+ class GPTJXMLP(nn.Module):
373
+ def __init__(self, intermediate_size, config):
374
+ super().__init__()
375
+ embed_dim = config.hidden_size
376
+ self.swiglu = config.activation_function == "swiglu"
377
+ self.c_fc = Conv1D(intermediate_size, embed_dim)
378
+ self.c_fc2 = Conv1D(intermediate_size, embed_dim) if self.swiglu else None
379
+ self.c_proj = Conv1D(embed_dim, intermediate_size)
380
+ self.act = SwiGLUActivation() if self.swiglu else ACT2FN[config.activation_function]
381
+ self.dropout = nn.Dropout(config.resid_pdrop)
382
+
383
+ def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor:
384
+ if self.swiglu:
385
+ hidden_states2 = self.c_fc2(hidden_states)
386
+ hidden_states = self.c_fc(hidden_states)
387
+ hidden_states = self.act(hidden_states, hidden_states2) if self.swiglu else self.act(hidden_states)
388
+ hidden_states = self.c_proj(hidden_states)
389
+ hidden_states = self.dropout(hidden_states)
390
+ return hidden_states
391
+
392
+
393
+ class DecoderBlock(nn.Module):
394
+ def __init__(self, config, layer_idx=None):
395
+ super().__init__()
396
+ hidden_size = config.hidden_size
397
+ inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size
398
+
399
+ self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
400
+ self.attn = GPTJXAttention(config, layer_idx=layer_idx)
401
+ self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
402
+
403
+ if config.add_cross_attention:
404
+ self.crossattention = GPTJXAttention(config, is_cross_attention=True, layer_idx=layer_idx)
405
+ self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
406
+
407
+ self.mlp = GPTJXMLP(inner_dim, config)
408
+
409
+ def forward(
410
+ self,
411
+ hidden_states: Optional[Tuple[torch.FloatTensor]],
412
+ layer_past: Optional[Tuple[torch.Tensor]] = None,
413
+ attention_mask: Optional[torch.FloatTensor] = None,
414
+ head_mask: Optional[torch.FloatTensor] = None,
415
+ encoder_hidden_states: Optional[torch.Tensor] = None,
416
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
417
+ use_cache: Optional[bool] = False,
418
+ output_attentions: Optional[bool] = False,
419
+ position_bias: Optional[torch.FloatTensor] = None,
420
+ ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]:
421
+ residual = hidden_states
422
+ hidden_states = self.ln_1(hidden_states)
423
+ attn_outputs = self.attn(
424
+ hidden_states,
425
+ layer_past=layer_past,
426
+ attention_mask=attention_mask,
427
+ head_mask=head_mask,
428
+ use_cache=use_cache,
429
+ output_attentions=output_attentions,
430
+ position_bias=position_bias,
431
+ )
432
+ attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
433
+ outputs = attn_outputs[1:]
434
+ # residual connection
435
+ hidden_states = attn_output + residual
436
+
437
+ if encoder_hidden_states is not None:
438
+ # add one self-attention block for cross-attention
439
+ if not hasattr(self, "crossattention"):
440
+ raise ValueError(
441
+ f"If `encoder_hidden_states` are passed, {self} has to be instantiated with "
442
+ "cross-attention layers by setting `config.add_cross_attention=True`"
443
+ )
444
+ residual = hidden_states
445
+ hidden_states = self.ln_cross_attn(hidden_states)
446
+ cross_attn_outputs = self.crossattention(
447
+ hidden_states,
448
+ attention_mask=attention_mask,
449
+ head_mask=head_mask,
450
+ encoder_hidden_states=encoder_hidden_states,
451
+ encoder_attention_mask=encoder_attention_mask,
452
+ output_attentions=output_attentions,
453
+ position_bias=position_bias,
454
+ )
455
+ attn_output = cross_attn_outputs[0]
456
+ # residual connection
457
+ hidden_states = residual + attn_output
458
+ outputs = outputs + cross_attn_outputs[2:] # add cross attentions if we output attention weights
459
+
460
+ residual = hidden_states
461
+ hidden_states = self.ln_2(hidden_states)
462
+ feed_forward_hidden_states = self.mlp(hidden_states)
463
+ # residual connection
464
+ hidden_states = residual + feed_forward_hidden_states
465
+
466
+ if use_cache:
467
+ outputs = (hidden_states,) + outputs
468
+ else:
469
+ outputs = (hidden_states,) + outputs[1:]
470
+
471
+ return outputs # hidden_states, present, (attentions, cross_attentions)
472
+
473
+
474
+ class GPTJXPreTrainedModel(PreTrainedModel):
475
+ """
476
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
477
+ models.
478
+ """
479
+
480
+ config_class = GPTJXConfig
481
+ load_tf_weights = load_tf_weights_in_model
482
+ base_model_prefix = "transformer"
483
+ is_parallelizable = True
484
+ supports_gradient_checkpointing = True
485
+ _no_split_modules = ["DecoderBlock"]
486
+ _skip_keys_device_placement = "past_key_values"
487
+
488
+ def __init__(self, *inputs, **kwargs):
489
+ super().__init__(*inputs, **kwargs)
490
+
491
+ def _init_weights(self, module):
492
+ """Initialize the weights."""
493
+ mup_init_scale = math.sqrt(self.config.mup_width_scale)
494
+ if isinstance(module, (nn.Linear, Conv1D)):
495
+ # Slightly different from the TF version which uses truncated_normal for initialization
496
+ # cf https://github.com/pytorch/pytorch/pull/5617
497
+ module.weight.data.normal_(mean=0.0, std=(self.config.initializer_range * mup_init_scale))
498
+ if module.bias is not None:
499
+ module.bias.data.zero_()
500
+ elif isinstance(module, nn.Embedding):
501
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
502
+ if module.padding_idx is not None:
503
+ module.weight.data[module.padding_idx].zero_()
504
+ elif isinstance(module, nn.LayerNorm):
505
+ module.bias.data.zero_()
506
+ module.weight.data.fill_(1.0)
507
+
508
+ for name, p in module.named_parameters():
509
+ if name == "c_proj.weight":
510
+ # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
511
+ stddev = self.config.initializer_range * mup_init_scale / math.sqrt(2 * self.config.n_layer)
512
+ p.data.normal_(mean=0.0, std=stddev)
513
+
514
+ def _set_gradient_checkpointing(self, module, value=False):
515
+ if isinstance(module, GPTJXModel):
516
+ module.gradient_checkpointing = value
517
+
518
+ def get_mup_param_groups(self, lr, weight_decay=0.0, decoupled_wd=True):
519
+ """
520
+ Returns list of dicts defining parameter groups for muP:
521
+ group 0: most model params get scaled learning rate and weight decay.
522
+ group 1: embedding layer gets non-scaled learning rate and weight decay.
523
+ group 2: normalization layers and biases get non-scaled learning rate only.
524
+
525
+ The output can be passed to Adam-base optimizers
526
+ e.g.
527
+ param_groups = model.get_mup_param_groups(lr=1e-3, weight_decay=0.1)
528
+ torch.optim.AdamW(param_groups, betas=(0.9, 0.95), eps=1e-8)
529
+ """
530
+ norm_modules = (
531
+ torch.nn.LayerNorm,
532
+ torch.nn.BatchNorm1d,
533
+ torch.nn.BatchNorm2d,
534
+ torch.nn.BatchNorm3d,
535
+ torch.nn.InstanceNorm1d,
536
+ torch.nn.InstanceNorm2d,
537
+ torch.nn.InstanceNorm3d,
538
+ torch.nn.GroupNorm,
539
+ torch.nn.SyncBatchNorm,
540
+ torch.nn.LocalResponseNorm,
541
+ )
542
+
543
+ def get_group_index(param_name):
544
+ for name, module in self.named_modules():
545
+ if name in param_name:
546
+ if isinstance(module, norm_modules):
547
+ return 2
548
+ elif isinstance(module, torch.nn.Embedding):
549
+ return 1
550
+ return 0
551
+
552
+ width_scale = self.config.mup_width_scale
553
+ new_param_groups = []
554
+ new_param_groups.append({"params": [], "lr": lr * width_scale, "weight_decay": weight_decay})
555
+ if not decoupled_wd:
556
+ new_param_groups[0]["weight_decay"] /= width_scale
557
+ new_param_groups.append({"params": [], "lr": lr, "weight_decay": weight_decay})
558
+ new_param_groups.append({"params": [], "lr": lr, "weight_decay": 0.0})
559
+
560
+ for name, param in self.named_parameters():
561
+ if not param.requires_grad:
562
+ continue
563
+
564
+ if name.endswith("bias"):
565
+ new_param_groups[2]["params"].append(param)
566
+ else:
567
+ new_param_groups[get_group_index(name)]["params"].append(param)
568
+
569
+ for idx, param_group in enumerate(new_param_groups):
570
+ if len(param_group["params"]) == 0:
571
+ del new_param_groups[idx]
572
+
573
+ return new_param_groups
574
+
575
+
576
+ START_DOCSTRING = r"""
577
+
578
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
579
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
580
+ etc.)
581
+
582
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
583
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
584
+ and behavior.
585
+
586
+ Parameters:
587
+ config ([`GPTJXConfig`]): Model configuration class with all the parameters of the model.
588
+ Initializing with a config file does not load the weights associated with the model, only the
589
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
590
+ """
591
+
592
+ INPUTS_DOCSTRING = r"""
593
+ Args:
594
+ input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):
595
+ `input_ids_length` = `sequence_length` if `past_key_values` is `None` else
596
+ `past_key_values[0][0].shape[-2]` (`sequence_length` of input past key value states). Indices of input
597
+ sequence tokens in the vocabulary.
598
+
599
+ If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
600
+ `input_ids`.
601
+
602
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
603
+ [`PreTrainedTokenizer.__call__`] for details.
604
+
605
+ [What are input IDs?](../glossary#input-ids)
606
+ past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`):
607
+ Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see
608
+ `past_key_values` output below). Can be used to speed up sequential decoding. The `input_ids` which have
609
+ their past given to this model should not be passed as `input_ids` as they have already been computed.
610
+ attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
611
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
612
+
613
+ - 1 for tokens that are **not masked**,
614
+ - 0 for tokens that are **masked**.
615
+
616
+ If `past_key_values` is used, `attention_mask` needs to contain the masking strategy that was used for
617
+ `past_key_values`. In other words, the `attention_mask` always has to have the length:
618
+ `len(past_key_values) + len(input_ids)`
619
+
620
+ [What are attention masks?](../glossary#attention-mask)
621
+ token_type_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*):
622
+ Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
623
+ 1]`:
624
+
625
+ - 0 corresponds to a *sentence A* token,
626
+ - 1 corresponds to a *sentence B* token.
627
+
628
+ [What are token type IDs?](../glossary#token-type-ids)
629
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
630
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
631
+ config.max_position_embeddings - 1]`.
632
+
633
+ [What are position IDs?](../glossary#position-ids)
634
+ head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
635
+ Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
636
+
637
+ - 1 indicates the head is **not masked**,
638
+ - 0 indicates the head is **masked**.
639
+
640
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
641
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
642
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
643
+ model's internal embedding lookup matrix.
644
+
645
+ If `past_key_values` is used, optionally only the last `inputs_embeds` have to be input (see
646
+ `past_key_values`).
647
+ use_cache (`bool`, *optional*):
648
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
649
+ `past_key_values`).
650
+ output_attentions (`bool`, *optional*):
651
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
652
+ tensors for more detail.
653
+ output_hidden_states (`bool`, *optional*):
654
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
655
+ more detail.
656
+ return_dict (`bool`, *optional*):
657
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
658
+ """
659
+ PARALLELIZE_DOCSTRING = r"""
660
+ This is an experimental feature and is a subject to change at a moment's notice.
661
+
662
+ Uses a device map to distribute attention modules of the model across several devices. If no device map is given,
663
+ it will evenly distribute blocks across all devices.
664
+
665
+ Args:
666
+ device_map (`Dict[int, list]`, optional, defaults to None):
667
+ A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are always
668
+ automatically mapped to the first device (for esoteric reasons). That means that the first device should
669
+ have fewer attention modules mapped to it than other devices. For reference, the GPT-JX model have the
670
+ following number of attention modules:
671
+ -gpt-jx-3b: 32
672
+ """
673
+ DEPARALLELIZE_DOCSTRING = r"""
674
+ Moves the model to cpu from a model parallel state.
675
+
676
+ Example:
677
+
678
+ ```python
679
+ # On a 4 GPU machine with gptjx:
680
+ model = GPTJXForCausalLM.from_pretrained("alien-ai/gpt-jx-3b")
681
+ device_map = {
682
+ 0: [0, 1, 2, 3, 4, 5, 6, 7],
683
+ 1: [8, 9, 10, 11, 12, 13, 14, 15],
684
+ 2: [16, 17, 18, 19, 20, 21, 22, 23],
685
+ 3: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
686
+ }
687
+ model.parallelize(device_map) # Splits the model across several devices
688
+ model.deparallelize() # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()
689
+ ```
690
+ """
691
+
692
+
693
+ @add_start_docstrings(
694
+ "The bare GPT-JX Model transformer outputting raw hidden-states without any specific head on top.",
695
+ START_DOCSTRING,
696
+ )
697
+ class GPTJXModel(GPTJXPreTrainedModel):
698
+ _keys_to_ignore_on_load_unexpected = [r"h\.\d+\.attn\.bias", r"h\.\d+\.attn\.masked_bias"]
699
+ _keys_to_ignore_on_load_missing = [r"attn.masked_bias", r"h\.\d+\.attn\.masked_bias", r"h\.\d+\.attn\.bias"]
700
+
701
+ def __init__(self, config):
702
+ super().__init__(config)
703
+
704
+ self.embed_dim = config.hidden_size
705
+
706
+ self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
707
+ self.wpe = (
708
+ nn.Embedding(config.max_position_embeddings, self.embed_dim)
709
+ if config.position_embedding_type != "alibi"
710
+ else None
711
+ )
712
+ self.embeddings_scale = config.mup_embeddings_scale
713
+
714
+ self.drop = nn.Dropout(config.embd_pdrop)
715
+ self.h = nn.ModuleList([DecoderBlock(config, layer_idx=i) for i in range(config.num_hidden_layers)])
716
+ self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
717
+
718
+ self.relative_pe = (
719
+ AlibiPositionEmbeddingLayer(config.num_attention_heads)
720
+ if config.position_embedding_type == "alibi"
721
+ else None
722
+ )
723
+
724
+ # Model parallel
725
+ self.model_parallel = False
726
+ self.device_map = None
727
+ self.gradient_checkpointing = False
728
+
729
+ # Initialize weights and apply final processing
730
+ self.post_init()
731
+
732
+ @add_start_docstrings(PARALLELIZE_DOCSTRING)
733
+ def parallelize(self, device_map=None):
734
+ # Check validity of device_map
735
+ warnings.warn(
736
+ "`GPTJXModel.parallelize` is deprecated and will be removed in v5 of Transformers, you should load your"
737
+ " model with `device_map='balanced'` in the call to `from_pretrained`. You can also provide your own"
738
+ " `device_map` but it needs to be a dictionary module_name to device, so for instance {'h.0': 0, 'h.1': 1,"
739
+ " ...}",
740
+ FutureWarning,
741
+ )
742
+ self.device_map = (
743
+ get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map
744
+ )
745
+ assert_device_map(self.device_map, len(self.h))
746
+ self.model_parallel = True
747
+ self.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + str(min(self.device_map.keys()))
748
+ self.last_device = "cuda:" + str(max(self.device_map.keys()))
749
+ self.wte = self.wte.to(self.first_device)
750
+ if self.wpe is not None:
751
+ self.wpe = self.wpe.to(self.first_device)
752
+ # Load onto devices
753
+ for k, v in self.device_map.items():
754
+ for block in v:
755
+ cuda_device = "cuda:" + str(k)
756
+ self.h[block] = self.h[block].to(cuda_device)
757
+ # ln_f to last
758
+ self.ln_f = self.ln_f.to(self.last_device)
759
+
760
+ @add_start_docstrings(DEPARALLELIZE_DOCSTRING)
761
+ def deparallelize(self):
762
+ warnings.warn(
763
+ "Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",
764
+ FutureWarning,
765
+ )
766
+ self.model_parallel = False
767
+ self.device_map = None
768
+ self.first_device = "cpu"
769
+ self.last_device = "cpu"
770
+ self.wte = self.wte.to("cpu")
771
+ if self.wpe is not None:
772
+ self.wpe = self.wpe.to("cpu")
773
+ for index in range(len(self.h)):
774
+ self.h[index] = self.h[index].to("cpu")
775
+ self.ln_f = self.ln_f.to("cpu")
776
+ torch.cuda.empty_cache()
777
+
778
+ def get_input_embeddings(self):
779
+ return self.wte
780
+
781
+ def set_input_embeddings(self, new_embeddings):
782
+ self.wte = new_embeddings
783
+
784
+ def _prune_heads(self, heads_to_prune):
785
+ """
786
+ Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
787
+ """
788
+ for layer, heads in heads_to_prune.items():
789
+ self.h[layer].attn.prune_heads(heads)
790
+
791
+ @add_start_docstrings_to_model_forward(INPUTS_DOCSTRING)
792
+ @add_code_sample_docstrings(
793
+ checkpoint=_CHECKPOINT_FOR_DOC,
794
+ output_type=BaseModelOutputWithPastAndCrossAttentions,
795
+ config_class=_CONFIG_FOR_DOC,
796
+ )
797
+ def forward(
798
+ self,
799
+ input_ids: Optional[torch.LongTensor] = None,
800
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
801
+ attention_mask: Optional[torch.FloatTensor] = None,
802
+ token_type_ids: Optional[torch.LongTensor] = None,
803
+ position_ids: Optional[torch.LongTensor] = None,
804
+ head_mask: Optional[torch.FloatTensor] = None,
805
+ inputs_embeds: Optional[torch.FloatTensor] = None,
806
+ encoder_hidden_states: Optional[torch.Tensor] = None,
807
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
808
+ use_cache: Optional[bool] = None,
809
+ output_attentions: Optional[bool] = None,
810
+ output_hidden_states: Optional[bool] = None,
811
+ return_dict: Optional[bool] = None,
812
+ ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
813
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
814
+ output_hidden_states = (
815
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
816
+ )
817
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
818
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
819
+
820
+ if input_ids is not None and inputs_embeds is not None:
821
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
822
+ elif input_ids is not None:
823
+ input_shape = input_ids.size()
824
+ input_ids = input_ids.view(-1, input_shape[-1])
825
+ batch_size = input_ids.shape[0]
826
+ elif inputs_embeds is not None:
827
+ input_shape = inputs_embeds.size()[:-1]
828
+ batch_size = inputs_embeds.shape[0]
829
+ else:
830
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
831
+
832
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
833
+
834
+ if token_type_ids is not None:
835
+ token_type_ids = token_type_ids.view(-1, input_shape[-1])
836
+ if position_ids is not None:
837
+ position_ids = position_ids.view(-1, input_shape[-1])
838
+
839
+ if past_key_values is None:
840
+ past_length = 0
841
+ past_key_values = tuple([None] * len(self.h))
842
+ else:
843
+ past_length = past_key_values[0][0].size(-2)
844
+ if position_ids is None:
845
+ position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
846
+ position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
847
+
848
+ # GPTJXAttention mask.
849
+ if attention_mask is not None:
850
+ if batch_size <= 0:
851
+ raise ValueError("batch_size has to be defined and > 0")
852
+ attention_mask = attention_mask.view(batch_size, -1)
853
+ # We create a 3D attention mask from a 2D tensor mask.
854
+ # Sizes are [batch_size, 1, 1, to_seq_length]
855
+ # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
856
+ # this attention mask is more simple than the triangular masking of causal attention
857
+ # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
858
+ attention_mask = attention_mask[:, None, None, :]
859
+
860
+ # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
861
+ # masked positions, this operation will create a tensor which is 0.0 for
862
+ # positions we want to attend and the dtype's smallest value for masked positions.
863
+ # Since we are adding it to the raw scores before the softmax, this is
864
+ # effectively the same as removing these entirely.
865
+ attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
866
+ attention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min
867
+
868
+ # If a 2D or 3D attention mask is provided for the cross-attention
869
+ # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
870
+ if self.config.add_cross_attention and encoder_hidden_states is not None:
871
+ encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
872
+ encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
873
+ if encoder_attention_mask is None:
874
+ encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
875
+ encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)
876
+ else:
877
+ encoder_attention_mask = None
878
+
879
+ # Prepare head mask if needed
880
+ # 1.0 in head_mask indicate we keep the head
881
+ # attention_probs has shape bsz x n_heads x N x N
882
+ # head_mask has shape n_layer x batch x n_heads x N x N
883
+ head_mask = self.get_head_mask(head_mask, self.config.n_layer)
884
+
885
+ if inputs_embeds is None:
886
+ inputs_embeds = self.wte(input_ids)
887
+ if self.wpe is not None:
888
+ position_embeds = self.wpe(position_ids)
889
+ hidden_states = inputs_embeds + position_embeds
890
+ else:
891
+ hidden_states = inputs_embeds
892
+ hidden_states *= torch.tensor(
893
+ float(self.embeddings_scale), dtype=hidden_states.dtype, device=hidden_states.device
894
+ )
895
+
896
+ if token_type_ids is not None:
897
+ token_type_embeds = self.wte(token_type_ids)
898
+ hidden_states = hidden_states + token_type_embeds
899
+
900
+ hidden_states = self.drop(hidden_states)
901
+
902
+ if self.relative_pe is not None:
903
+ length = input_ids.shape[1]
904
+ cached_kv_length = 0
905
+ cached_kv = past_key_values[0]
906
+ if cached_kv is not None:
907
+ cached_kv_length = cached_kv[0].shape[-2]
908
+ position_bias = self.relative_pe(length, length, cached_kv_length)
909
+ else:
910
+ position_bias = None
911
+
912
+ output_shape = input_shape + (hidden_states.size(-1),)
913
+
914
+ if self.gradient_checkpointing and self.training:
915
+ if use_cache:
916
+ logger.warning_once(
917
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
918
+ )
919
+ use_cache = False
920
+
921
+ presents = () if use_cache else None
922
+ all_self_attentions = () if output_attentions else None
923
+ all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
924
+ all_hidden_states = () if output_hidden_states else None
925
+ for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
926
+ # Model parallel
927
+ if self.model_parallel:
928
+ torch.cuda.set_device(hidden_states.device)
929
+ # Ensure layer_past is on same device as hidden_states (might not be correct)
930
+ if layer_past is not None:
931
+ layer_past = tuple(past_state.to(hidden_states.device) for past_state in layer_past)
932
+ # Ensure that attention_mask is always on the same device as hidden_states
933
+ if attention_mask is not None:
934
+ attention_mask = attention_mask.to(hidden_states.device)
935
+ if isinstance(head_mask, torch.Tensor):
936
+ head_mask = head_mask.to(hidden_states.device)
937
+ if output_hidden_states:
938
+ all_hidden_states = all_hidden_states + (hidden_states,)
939
+
940
+ if self.gradient_checkpointing and self.training:
941
+
942
+ def create_custom_forward(module):
943
+ def custom_forward(*inputs):
944
+ # None for past_key_value
945
+ return module(*inputs, use_cache, output_attentions)
946
+
947
+ return custom_forward
948
+
949
+ outputs = torch.utils.checkpoint.checkpoint(
950
+ create_custom_forward(block),
951
+ hidden_states,
952
+ None,
953
+ attention_mask,
954
+ head_mask[i],
955
+ encoder_hidden_states,
956
+ encoder_attention_mask,
957
+ )
958
+ else:
959
+ outputs = block(
960
+ hidden_states,
961
+ layer_past=layer_past,
962
+ attention_mask=attention_mask,
963
+ head_mask=head_mask[i],
964
+ encoder_hidden_states=encoder_hidden_states,
965
+ encoder_attention_mask=encoder_attention_mask,
966
+ use_cache=use_cache,
967
+ output_attentions=output_attentions,
968
+ position_bias=position_bias,
969
+ )
970
+
971
+ hidden_states = outputs[0]
972
+ if use_cache is True:
973
+ presents = presents + (outputs[1],)
974
+
975
+ if output_attentions:
976
+ all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
977
+ if self.config.add_cross_attention:
978
+ all_cross_attentions = all_cross_attentions + (outputs[3 if use_cache else 2],)
979
+
980
+ # Model Parallel: If it's the last layer for that device, put things on the next device
981
+ if self.model_parallel:
982
+ for k, v in self.device_map.items():
983
+ if i == v[-1] and "cuda:" + str(k) != self.last_device:
984
+ hidden_states = hidden_states.to("cuda:" + str(k + 1))
985
+
986
+ hidden_states = self.ln_f(hidden_states)
987
+
988
+ hidden_states = hidden_states.view(output_shape)
989
+ # Add last hidden state
990
+ if output_hidden_states:
991
+ all_hidden_states = all_hidden_states + (hidden_states,)
992
+
993
+ if not return_dict:
994
+ return tuple(
995
+ v
996
+ for v in [hidden_states, presents, all_hidden_states, all_self_attentions, all_cross_attentions]
997
+ if v is not None
998
+ )
999
+
1000
+ return BaseModelOutputWithPastAndCrossAttentions(
1001
+ last_hidden_state=hidden_states,
1002
+ past_key_values=presents,
1003
+ hidden_states=all_hidden_states,
1004
+ attentions=all_self_attentions,
1005
+ cross_attentions=all_cross_attentions,
1006
+ )
1007
+
1008
+
1009
+ @add_start_docstrings(
1010
+ """
1011
+ The GPTJX Model transformer with a language modeling head on top (linear layer with weights tied to the input
1012
+ embeddings).
1013
+ """,
1014
+ START_DOCSTRING,
1015
+ )
1016
+ class GPTJXForCausalLM(GPTJXPreTrainedModel):
1017
+ _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
1018
+ _keys_to_ignore_on_load_unexpected = [r"h\.\d+\.attn\.masked_bias", r"h\.\d+\.attn\.bias"]
1019
+
1020
+ def __init__(self, config):
1021
+ super().__init__(config)
1022
+ self.transformer = GPTJXModel(config)
1023
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
1024
+ self.output_logits_scale = config.mup_output_alpha * config.mup_width_scale
1025
+
1026
+ # Model parallel
1027
+ self.model_parallel = False
1028
+ self.device_map = None
1029
+
1030
+ # Initialize weights and apply final processing
1031
+ self.post_init()
1032
+
1033
+ @add_start_docstrings(PARALLELIZE_DOCSTRING)
1034
+ def parallelize(self, device_map=None):
1035
+ warnings.warn(
1036
+ "`GPTJXForCausalLM.parallelize` is deprecated and will be removed in v5 of Transformers, you should load"
1037
+ " your model with `device_map='balanced'` in the call to `from_pretrained`. You can also provide your own"
1038
+ " `device_map` but it needs to be a dictionary module_name to device, so for instance {'transformer.h.0':"
1039
+ " 0, 'transformer.h.1': 1, ...}",
1040
+ FutureWarning,
1041
+ )
1042
+ self.device_map = (
1043
+ get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))
1044
+ if device_map is None
1045
+ else device_map
1046
+ )
1047
+ assert_device_map(self.device_map, len(self.transformer.h))
1048
+ self.transformer.parallelize(self.device_map)
1049
+ self.lm_head = self.lm_head.to(self.transformer.first_device)
1050
+ self.model_parallel = True
1051
+
1052
+ @add_start_docstrings(DEPARALLELIZE_DOCSTRING)
1053
+ def deparallelize(self):
1054
+ warnings.warn(
1055
+ "Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",
1056
+ FutureWarning,
1057
+ )
1058
+ self.transformer.deparallelize()
1059
+ self.transformer = self.transformer.to("cpu")
1060
+ self.lm_head = self.lm_head.to("cpu")
1061
+ self.model_parallel = False
1062
+ torch.cuda.empty_cache()
1063
+
1064
+ def get_output_embeddings(self):
1065
+ return self.lm_head
1066
+
1067
+ def set_output_embeddings(self, new_embeddings):
1068
+ self.lm_head = new_embeddings
1069
+
1070
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
1071
+ token_type_ids = kwargs.get("token_type_ids", None)
1072
+ # only last token for inputs_ids if past is defined in kwargs
1073
+ if past_key_values:
1074
+ input_ids = input_ids[:, -1].unsqueeze(-1)
1075
+ if token_type_ids is not None:
1076
+ token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
1077
+
1078
+ attention_mask = kwargs.get("attention_mask", None)
1079
+ position_ids = kwargs.get("position_ids", None)
1080
+
1081
+ if attention_mask is not None and position_ids is None:
1082
+ # create position_ids on the fly for batch generation
1083
+ position_ids = attention_mask.long().cumsum(-1) - 1
1084
+ position_ids.masked_fill_(attention_mask == 0, 1)
1085
+ if past_key_values:
1086
+ position_ids = position_ids[:, -1].unsqueeze(-1)
1087
+ else:
1088
+ position_ids = None
1089
+
1090
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1091
+ if inputs_embeds is not None and past_key_values is None:
1092
+ model_inputs = {"inputs_embeds": inputs_embeds}
1093
+ else:
1094
+ model_inputs = {"input_ids": input_ids}
1095
+
1096
+ model_inputs.update(
1097
+ {
1098
+ "past_key_values": past_key_values,
1099
+ "use_cache": kwargs.get("use_cache"),
1100
+ "position_ids": position_ids,
1101
+ "attention_mask": attention_mask,
1102
+ "token_type_ids": token_type_ids,
1103
+ }
1104
+ )
1105
+ return model_inputs
1106
+
1107
+ @add_start_docstrings_to_model_forward(INPUTS_DOCSTRING)
1108
+ @add_code_sample_docstrings(
1109
+ checkpoint=_CHECKPOINT_FOR_DOC,
1110
+ output_type=CausalLMOutputWithCrossAttentions,
1111
+ config_class=_CONFIG_FOR_DOC,
1112
+ )
1113
+ def forward(
1114
+ self,
1115
+ input_ids: Optional[torch.LongTensor] = None,
1116
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
1117
+ attention_mask: Optional[torch.FloatTensor] = None,
1118
+ token_type_ids: Optional[torch.LongTensor] = None,
1119
+ position_ids: Optional[torch.LongTensor] = None,
1120
+ head_mask: Optional[torch.FloatTensor] = None,
1121
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1122
+ encoder_hidden_states: Optional[torch.Tensor] = None,
1123
+ encoder_attention_mask: Optional[torch.FloatTensor] = None,
1124
+ labels: Optional[torch.LongTensor] = None,
1125
+ use_cache: Optional[bool] = None,
1126
+ output_attentions: Optional[bool] = None,
1127
+ output_hidden_states: Optional[bool] = None,
1128
+ return_dict: Optional[bool] = None,
1129
+ ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
1130
+ r"""
1131
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1132
+ Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
1133
+ `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
1134
+ are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
1135
+ """
1136
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1137
+
1138
+ transformer_outputs = self.transformer(
1139
+ input_ids,
1140
+ past_key_values=past_key_values,
1141
+ attention_mask=attention_mask,
1142
+ token_type_ids=token_type_ids,
1143
+ position_ids=position_ids,
1144
+ head_mask=head_mask,
1145
+ inputs_embeds=inputs_embeds,
1146
+ encoder_hidden_states=encoder_hidden_states,
1147
+ encoder_attention_mask=encoder_attention_mask,
1148
+ use_cache=use_cache,
1149
+ output_attentions=output_attentions,
1150
+ output_hidden_states=output_hidden_states,
1151
+ return_dict=return_dict,
1152
+ )
1153
+ hidden_states = transformer_outputs[0]
1154
+
1155
+ # Set device for model parallelism
1156
+ if self.model_parallel:
1157
+ torch.cuda.set_device(self.transformer.first_device)
1158
+ hidden_states = hidden_states.to(self.lm_head.weight.device)
1159
+
1160
+ lm_logits = self.lm_head(hidden_states)
1161
+ lm_logits *= torch.tensor(float(self.output_logits_scale), dtype=lm_logits.dtype, device=lm_logits.device)
1162
+
1163
+ loss = None
1164
+ if labels is not None:
1165
+ # move labels to correct device to enable model parallelism
1166
+ labels = labels.to(lm_logits.device)
1167
+ # Shift so that tokens < n predict n
1168
+ shift_logits = lm_logits[..., :-1, :].contiguous()
1169
+ shift_labels = labels[..., 1:].contiguous()
1170
+ # Flatten the tokens
1171
+ loss_fct = CrossEntropyLoss()
1172
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
1173
+
1174
+ if not return_dict:
1175
+ output = (lm_logits,) + transformer_outputs[1:]
1176
+ return ((loss,) + output) if loss is not None else output
1177
+
1178
+ return CausalLMOutputWithCrossAttentions(
1179
+ loss=loss,
1180
+ logits=lm_logits,
1181
+ past_key_values=transformer_outputs.past_key_values,
1182
+ hidden_states=transformer_outputs.hidden_states,
1183
+ attentions=transformer_outputs.attentions,
1184
+ cross_attentions=transformer_outputs.cross_attentions,
1185
+ )
1186
+
1187
+ @staticmethod
1188
+ def _reorder_cache(
1189
+ past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor
1190
+ ) -> Tuple[Tuple[torch.Tensor]]:
1191
+ """
1192
+ This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
1193
+ [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
1194
+ beam_idx at every generation step.
1195
+ """
1196
+ return tuple(
1197
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
1198
+ for layer_past in past_key_values
1199
+ )
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c99feaec6acd74a23d674bbc52ca50b7f1109a3537bcb883a85b104905945cde
3
+ size 5292652837
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenization_gptjx.py ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ Tokenization classes for GPT-JX """
2
+ import json
3
+ import os
4
+ from functools import lru_cache
5
+ from typing import TYPE_CHECKING, List, Optional, Tuple
6
+
7
+ import regex as re
8
+
9
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
10
+ from transformers.utils import logging
11
+
12
+
13
+ if TYPE_CHECKING:
14
+ from transformers.pipelines.conversational import Conversation
15
+
16
+ logger = logging.get_logger(__name__)
17
+
18
+ VOCAB_FILES_NAMES = {
19
+ "vocab_file": "vocab.json",
20
+ "merges_file": "merges.txt",
21
+ }
22
+
23
+ @lru_cache()
24
+ def bytes_to_unicode():
25
+ """
26
+ Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
27
+ characters the bpe code barfs on.
28
+
29
+ The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
30
+ if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
31
+ decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
32
+ tables between utf-8 bytes and unicode strings.
33
+ """
34
+ bs = (
35
+ list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
36
+ )
37
+ cs = bs[:]
38
+ n = 0
39
+ for b in range(2**8):
40
+ if b not in bs:
41
+ bs.append(b)
42
+ cs.append(2**8 + n)
43
+ n += 1
44
+ cs = [chr(n) for n in cs]
45
+ return dict(zip(bs, cs))
46
+
47
+
48
+ def get_pairs(word):
49
+ """
50
+ Return set of symbol pairs in a word.
51
+
52
+ Word is represented as tuple of symbols (symbols being variable-length strings).
53
+ """
54
+ pairs = set()
55
+ prev_char = word[0]
56
+ for char in word[1:]:
57
+ pairs.add((prev_char, char))
58
+ prev_char = char
59
+ return pairs
60
+
61
+
62
+ class GPTJXTokenizer(PreTrainedTokenizer):
63
+ """
64
+ Construct a GPT-JX tokenizer. Based on byte-level Byte-Pair-Encoding.
65
+
66
+ This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
67
+ be encoded differently whether it is at the beginning of the sentence (without space) or not:
68
+
69
+ You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
70
+ call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
71
+
72
+ <Tip>
73
+
74
+ When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one).
75
+
76
+ </Tip>
77
+
78
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
79
+ this superclass for more information regarding those methods.
80
+
81
+ Args:
82
+ vocab_file (`str`):
83
+ Path to the vocabulary file.
84
+ merges_file (`str`):
85
+ Path to the merges file.
86
+ errors (`str`, *optional*, defaults to `"replace"`):
87
+ Paradigm to follow when decoding bytes to UTF-8. See
88
+ [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
89
+ unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
90
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
91
+ token instead.
92
+ bos_token (`str`, *optional*, defaults to `<|endoftext|>`):
93
+ The beginning of sequence token.
94
+ eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
95
+ The end of sequence token.
96
+ add_prefix_space (`bool`, *optional*, defaults to `False`):
97
+ Whether or not to add an initial space to the input. This allows to treat the leading word just as any
98
+ other word. (GPT-JX tokenizer detect beginning of words by the preceding space).
99
+ """
100
+ model_input_names = ["input_ids", "attention_mask"]
101
+
102
+ def __init__(
103
+ self,
104
+ vocab_file,
105
+ merges_file,
106
+ errors="replace",
107
+ unk_token="<|endoftext|>",
108
+ bos_token="<|endoftext|>",
109
+ eos_token="<|endoftext|>",
110
+ pad_token=None,
111
+ add_prefix_space=False,
112
+ add_bos_token=False,
113
+ **kwargs,
114
+ ):
115
+ bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
116
+ eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
117
+ unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
118
+ pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
119
+ super().__init__(
120
+ errors=errors,
121
+ unk_token=unk_token,
122
+ bos_token=bos_token,
123
+ eos_token=eos_token,
124
+ pad_token=pad_token,
125
+ add_prefix_space=add_prefix_space,
126
+ add_bos_token=add_bos_token,
127
+ **kwargs,
128
+ )
129
+ self.add_bos_token = add_bos_token
130
+
131
+ with open(vocab_file, encoding="utf-8") as vocab_handle:
132
+ self.encoder = json.load(vocab_handle)
133
+ self.decoder = {v: k for k, v in self.encoder.items()}
134
+ self.errors = errors # how to handle errors in decoding
135
+ self.byte_encoder = bytes_to_unicode()
136
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
137
+ with open(merges_file, encoding="utf-8") as merges_handle:
138
+ bpe_merges = merges_handle.read().split("\n")[1:-1]
139
+ bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
140
+ self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
141
+ self.cache = {}
142
+ self.add_prefix_space = add_prefix_space
143
+
144
+ # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
145
+ self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
146
+
147
+ @property
148
+ def vocab_size(self):
149
+ return len(self.encoder)
150
+
151
+ def get_vocab(self):
152
+ return dict(self.encoder, **self.added_tokens_encoder)
153
+
154
+ def bpe(self, token):
155
+ if token in self.cache:
156
+ return self.cache[token]
157
+ word = tuple(token)
158
+ pairs = get_pairs(word)
159
+
160
+ if not pairs:
161
+ return token
162
+
163
+ while True:
164
+ bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
165
+ if bigram not in self.bpe_ranks:
166
+ break
167
+ first, second = bigram
168
+ new_word = []
169
+ i = 0
170
+ while i < len(word):
171
+ try:
172
+ j = word.index(first, i)
173
+ except ValueError:
174
+ new_word.extend(word[i:])
175
+ break
176
+ else:
177
+ new_word.extend(word[i:j])
178
+ i = j
179
+
180
+ if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
181
+ new_word.append(first + second)
182
+ i += 2
183
+ else:
184
+ new_word.append(word[i])
185
+ i += 1
186
+ new_word = tuple(new_word)
187
+ word = new_word
188
+ if len(word) == 1:
189
+ break
190
+ else:
191
+ pairs = get_pairs(word)
192
+ word = " ".join(word)
193
+ self.cache[token] = word
194
+ return word
195
+
196
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
197
+ if self.add_bos_token:
198
+ bos_token_ids = [self.bos_token_id]
199
+ else:
200
+ bos_token_ids = []
201
+
202
+ output = bos_token_ids + token_ids_0
203
+
204
+ if token_ids_1 is None:
205
+ return output
206
+
207
+ return output + bos_token_ids + token_ids_1
208
+
209
+ def get_special_tokens_mask(
210
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
211
+ ) -> List[int]:
212
+ """
213
+ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
214
+ special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods.
215
+
216
+ Args:
217
+ token_ids_0 (`List[int]`):
218
+ List of IDs.
219
+ token_ids_1 (`List[int]`, *optional*):
220
+ Optional second list of IDs for sequence pairs.
221
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
222
+ Whether or not the token list is already formatted with special tokens for the model.
223
+
224
+ Returns:
225
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
226
+ """
227
+ if already_has_special_tokens:
228
+ return super().get_special_tokens_mask(
229
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
230
+ )
231
+
232
+ if not self.add_bos_token:
233
+ return super().get_special_tokens_mask(
234
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=False
235
+ )
236
+
237
+ if token_ids_1 is None:
238
+ return [1] + ([0] * len(token_ids_0))
239
+ return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
240
+
241
+ def _tokenize(self, text):
242
+ """Tokenize a string."""
243
+ bpe_tokens = []
244
+ for token in re.findall(self.pat, text):
245
+ token = "".join(
246
+ self.byte_encoder[b] for b in token.encode("utf-8")
247
+ ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
248
+ bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
249
+ return bpe_tokens
250
+
251
+ def _convert_token_to_id(self, token):
252
+ """Converts a token (str) in an id using the vocab."""
253
+ return self.encoder.get(token, self.encoder.get(self.unk_token))
254
+
255
+ def _convert_id_to_token(self, index):
256
+ """Converts an index (integer) in a token (str) using the vocab."""
257
+ return self.decoder.get(index)
258
+
259
+ def convert_tokens_to_string(self, tokens):
260
+ """Converts a sequence of tokens (string) in a single string."""
261
+ text = "".join(tokens)
262
+ text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
263
+ return text
264
+
265
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
266
+ if not os.path.isdir(save_directory):
267
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
268
+ return
269
+ vocab_file = os.path.join(
270
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
271
+ )
272
+ merge_file = os.path.join(
273
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
274
+ )
275
+
276
+ with open(vocab_file, "w", encoding="utf-8") as f:
277
+ f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
278
+
279
+ index = 0
280
+ with open(merge_file, "w", encoding="utf-8") as writer:
281
+ writer.write("#version: 0.2\n")
282
+ for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
283
+ if index != token_index:
284
+ logger.warning(
285
+ f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive."
286
+ " Please check that the tokenizer is not corrupted!"
287
+ )
288
+ index = token_index
289
+ writer.write(" ".join(bpe_tokens) + "\n")
290
+ index += 1
291
+
292
+ return vocab_file, merge_file
293
+
294
+ def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
295
+ add_prefix_space = kwargs.pop("add_prefix_space", self.add_prefix_space)
296
+ if is_split_into_words or add_prefix_space:
297
+ text = " " + text
298
+ return (text, kwargs)
299
+
300
+ def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]:
301
+ input_ids = []
302
+ for is_user, text in conversation.iter_texts():
303
+ input_ids.extend(self.encode(text, add_special_tokens=False) + [self.eos_token_id])
304
+ if len(input_ids) > self.model_max_length:
305
+ input_ids = input_ids[-self.model_max_length :]
306
+ return input_ids
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 8192,
7
+ "tokenizer_class": "GPT2Tokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff