KnutJaegersberg/GPT-JX-3b

Important Note: I have not created this MIT licensed model, I discovered it and downloaded it. It was taken down by its creators, so I reupload it. More info:

https://github.com/huggingface/transformers/issues/25723

Model Description

GPT-JX is a 3 billion paramter autoregressive Foundational Large Language Model pre-trained on High Quality, Cleaned and Deduplicated 1.1 trillion tokens of english text and code. GPT-JX uses the base architecture of traditional Transformers Decoder with slight changes which is discussed later. GPT-JX was pre-trained on tokens for English text and 20 Programming Languages. GPT-JX shows impressing performance when compared to Large Language Models with 7 billion parameters such as LLaMa-7B-v2, Falcon-7B & MPT-7B.

Model Architecture

We made slight changes to the traditional Transformers Decoder to create the Base Architecture for our GPT-JX model, the changes are listed:

We used the SwiGLU activation function in the architecture of GPT-JX instead of ReLU.
Attention with Linear Biases(AliBi) was used as positional embeddings for GPT-JX instead of absolute positional embeddings(as used in traditional Transformers Decoder) and Rotatary Positional Embeddings(as used in case of GPT-J & GPT-NeoX)

Below is GPT-JX's architectural Specs

Trainable Parameters: 2646255776
Number of Layers(n_layers): 32
Dimension of the Model(d_model): 2560
Dimension of Feed Forward Network(d_ff): 6826
Number of Heads(n_heads): 32
Dimension of each Head(d_head): 80
Sequence Length(n_ctx): 8192
Vocab Size(n_vocab): 50257
Positional Embedding: AliBi
Tokenizer: GPT-2/GPT-3

GPT-JX was trained with the Vocaulary Size of 50257 , using the same set of BPEs as GPT-2/GPT-3.

Unsupervised Training Data(Pre-Training Data)

GPT-JX was pre-trained upon High Quality, Cleaned and Deduplicated dataset mixture consisting:

600B tokens of Common Crawl english text from RefinedWeb-Text.
175B tokens of Code among 20 Programming Languages from The-Stack-Dedup.
327B tokens from SlimPajama(C4,GitHub,Wikipedia,ArXiv,StackExchange,GutenbergBooks)

In Total the pre-training data sums to 1.1 trillion tokens.

Breif Description of the Datasets

RefinedWeb-Text is High Quality Deduplicated english Common Crawl Text dataset which was released by Technology Innovation Institute.
The-Stack-Dedup is Cleaned and Deduplicated version of The-Stack, the dataset covers 300+ Programming Languages, it was released by Big Code.
SlimPajama is Cleaned, High Quality and Deduplicated version of RedPajama-Data, the dataset contains english text from the Common Crawl, C4, GitHub, Wikipedia, StackExchange and GutenBerg Book, which was released by Cerebras.

Data Mixture Proportion

Dataset	Data Proportion	Tokens
RefinedWeb-Text	54.4%	600B
The-Stack-Dedup	15.9%	175B
SlimPajama	29.7%	327B
Total Tokens	---	1.1T

† Information: GPT-JX was trained on 726*A100 40GB GPUs which were sponsored by StabilityAI and Cerebras, special thanks to StabilityAI and Cerebras for sharing their GPUs.

Libraries and Inference

Libraries required to use GPT-JX are:

pip install torch transformers

GPT-JX is currently only compatiable with the Auto Classes of Transformers Library.

Load GPT-JX using Transformer Auto Classes:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_repo = "alien-ai/gpt-jx-3b"
model = AutoModelForCausalLM.from_pretrained(
    model_repo, torch_dtype = torch.float16, device_map = "auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_repo)

In Future we are planning to release our own python package to perform inference and fine-tune our models in efficient and user friendly way.

Intended Use and Limitations

GPT-JX learns an inner representation of English Language as well as Programming Languages that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.

Out-of-scope use

GPT-JX is not intended for deployment without fine-tuning, supervision, and/or moderation. It is not a in itself a product and cannot be used for human-facing interactions. For example, the model may generate harmful or offensive text. Please evaluate the risks associated with your particular use case.

GPT-JX was trained on an English-language only dataset, and is thus not suitable for translation or generating text in other languages.

Limitations and Biases

The core functionality of GPT-JX is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-JX it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-JX to produce factually accurate output.

Evaluation

Below are some evaluation results for GPT-JX in comparision to LLaMa-7B-v1 and Falcon-7B.

License

We release GPT-JX under MIT License(License provided by Massachusetts Institute of Technology).

Citation

@article{refinedweb,
  title={Attention Is All You Need},
  author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
  journal={arXiv preprint arXiv:1706.03762 },
  eprint={1706.03762},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/1706.03762 },
  year={2023}
}

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

@article{refinedweb,
  title={GLU Variants Improve Transformer},
  author={Noam Shazeer},
  journal={arXiv preprint arXiv:2002.05202},
  eprint={2002.05202},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2002.05202},
  year={2023}
}

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

@article{Kocetkov2022TheStack,
  title={The Stack: 3 TB of permissively licensed source code},
  author={Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Li, Jia and Mou,Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Bahdanau, Dzmitry and von Werra, Leandro and de Vries, Harm},
  journal={Preprint},
  eprint={2211.15533},
  eprinttype={arXiv},
  url={https://arxiv.org/abs/2211.15533}
  year={2022}
}

KnutJaegersberg
/

GPT-JX-3b