--- license: mit datasets: - tiiuae/falcon-refinedweb - bigcode/the-stack-dedup - cerebras/SlimPajama-627B language: - en - code library_name: transformers pipeline_tag: text-generation inference: false tags: - text-generation-inference --- Important Note: I have not created this MIT licensed model, I discovered it and downloaded it. It was taken down by its creators, so I reupload it. More info: https://github.com/huggingface/transformers/issues/25723 ### **Model Description** ***GPT-JX*** is a **3 billion paramter** autoregressive Foundational Large Language Model pre-trained on *High Quality*, *Cleaned* and *Deduplicated* **1.1 trillion tokens** of english text and code. ***GPT-JX*** uses the base architecture of traditional *Transformers Decoder* with **slight changes** which is discussed later. ***GPT-JX*** was pre-trained on tokens for **English text** and **20 Programming Languages**. ***GPT-JX*** shows impressing performance when compared to **Large Language Models with 7 billion parameters** such as **LLaMa-7B-v2, Falcon-7B & MPT-7B**. ### **Model Architecture** We made slight changes to the traditional *Transformers Decoder* to create the Base Architecture for our ***GPT-JX*** model, the changes are listed: - We used the **SwiGLU activation function** in the architecture of ***GPT-JX*** instead of ReLU. - **Attention with Linear Biases(AliBi)** was used as positional embeddings for ***GPT-JX*** instead of absolute positional embeddings(as used in traditional *Transformers Decoder*) and *Rotatary Positional Embeddings*(as used in case of **GPT-J** & **GPT-NeoX**) ***Below is GPT-JX's architectural Specs*** - **Trainable Parameters:** *2646255776* - **Number of Layers(nlayers):** *32* - **Dimension of the Model(dmodel):** *2560* - **Dimension of Feed Forward Network(dff):** *6826* - **Number of Heads(nheads):** *32* - **Dimension of each Head(dhead):** *80* - **Sequence Length(nctx):** *8192* - **Vocab Size(nvocab):** *50257* - **Positional Embedding:** *AliBi* - **Tokenizer:** *GPT-2/GPT-3* ***GPT-JX*** was trained with the Vocaulary Size of 50257 , using the same set of BPEs as GPT-2/GPT-3. ### **Unsupervised Training Data(Pre-Training Data)** ***GPT-JX*** was pre-trained upon *High Quality, Cleaned* and *Deduplicated* dataset mixture consisting: - **600B tokens** of Common Crawl english text from **RefinedWeb-Text**. - **175B tokens** of Code among 20 Programming Languages from **The-Stack-Dedup**. - **327B tokens** from **SlimPajama**(*C4,GitHub,Wikipedia,ArXiv,StackExchange,GutenbergBooks*) In Total the pre-training data sums to **1.1 trillion tokens**. ***Breif Description of the Datasets*** - **RefinedWeb-Text** is High Quality Deduplicated english **Common Crawl** Text dataset which was released by **Technology Innovation Institute**. - **The-Stack-Dedup** is Cleaned and Deduplicated version of **The-Stack**, the dataset covers *300+ Programming Languages*, it was released by **Big Code**. - **SlimPajama** is Cleaned, High Quality and Deduplicated version of **RedPajama-Data**, the dataset contains english text from the ***Common Crawl, C4, GitHub, Wikipedia, StackExchange and GutenBerg Book***, which was released by **Cerebras**. ***Data Mixture Proportion*** | Dataset | Data Proportion | Tokens | |---|---|---| | **RefinedWeb-Text** | **54.4%** | **600B** | | **The-Stack-Dedup** | **15.9%** | **175B** | | **SlimPajama** | **29.7%** | **327B** | | **Total Tokens** |---| **1.1T** |

Information: GPT-JX was trained on 726*A100 40GB GPUs which were sponsored by StabilityAI and Cerebras, special thanks to StabilityAI and Cerebras for sharing their GPUs.

### **Libraries and Inference** Libraries required to use **GPT-JX** are: ``` pip install torch transformers ``` ***GPT-JX*** is currently only compatiable with the ***Auto Classes of Transformers Library.*** Load ***GPT-JX*** using Transformer Auto Classes: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_repo = "alien-ai/gpt-jx-3b" model = AutoModelForCausalLM.from_pretrained( model_repo, torch_dtype = torch.float16, device_map = "auto" ) tokenizer = AutoTokenizer.from_pretrained(model_repo) ``` *In Future we are planning to release our own python package to perform inference and fine-tune our models in efficient and user friendly way.* ### **Intended Use and Limitations** ***GPT-JX*** learns an inner representation of English Language as well as Programming Languages that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt. ### **Out-of-scope use** ***GPT-JX*** is **not** intended for deployment without fine-tuning, supervision, and/or moderation. It is not a in itself a product and cannot be used for human-facing interactions. For example, the model may generate harmful or offensive text. Please evaluate the risks associated with your particular use case. ***GPT-JX*** was trained on an English-language only dataset, and is thus **not** suitable for translation or generating text in other languages. ### **Limitations and Biases** The core functionality of ***GPT-JX*** is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting ***GPT-JX*** it is important to remember that the statistically most likely next token is often not the token that produces the most **"accurate"** text. Never depend upon ***GPT-JX*** to produce factually accurate output. ### **Evaluation** Below are some evaluation results for ***GPT-JX*** in comparision to **LLaMa-7B-v1 and Falcon-7B**. | GPT-JX | | Average | | 51.9 | | Falcon-7B | | Average | | 53.5 | | LLaMa-7B-v2 | | 55 | ### **License** We release ***GPT-JX*** under **MIT License(License provided by Massachusetts Institute of Technology).** ### **Citation**
```javascript @article{refinedweb, title={Attention Is All You Need}, author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin}, journal={arXiv preprint arXiv:1706.03762 }, eprint={1706.03762}, eprinttype = {arXiv}, url={https://arxiv.org/abs/1706.03762 }, year={2023} } ``` ```javascript @article{refinedweb, title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only}, author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay}, journal={arXiv preprint arXiv:2306.01116}, eprint={2306.01116}, eprinttype = {arXiv}, url={https://arxiv.org/abs/2306.01116}, year={2023} } ``` ```javascript @article{refinedweb, title={GLU Variants Improve Transformer}, author={Noam Shazeer}, journal={arXiv preprint arXiv:2002.05202}, eprint={2002.05202}, eprinttype = {arXiv}, url={https://arxiv.org/abs/2002.05202}, year={2023} } ``` ```javascript @article{refinedweb, title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only}, author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay}, journal={arXiv preprint arXiv:2306.01116}, eprint={2306.01116}, eprinttype = {arXiv}, url={https://arxiv.org/abs/2306.01116}, year={2023} } ``` ```javascript @article{Kocetkov2022TheStack, title={The Stack: 3 TB of permissively licensed source code}, author={Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou,Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm}, journal={Preprint}, eprint={2211.15533}, eprinttype={arXiv}, url={https://arxiv.org/abs/2211.15533} year={2022} } ```