--- language: - en tags: - pytorch - causal-lm datasets: - The Pile - tiny_shakespeare inference: false --- # GPT-J 6b Shakespeare
1.) The "Hosted inference API" is turned off. Go to the How to Use section
2.) This is a "proof of concept" and not fully trained, simple training script also in "How to Use" section.
## Model Description
GPT-J 6B is a transformer model trained using Ben Wang's [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax/). "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
This checkpoint is a finetuned version of the original [GPT-J 6b](https://huggingface.co/EleutherAI/gpt-j-6B) on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare)
## Training data
GPT-J 6B was trained on [the Pile](https://pile.eleuther.ai), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai).
This checkpoint was afterwards finetuned on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare) by [crumb](https://huggingface.co/crumb) (me)
> 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
## Training Procedure
| Parameter | Value |
|----------------------|------------|
| epochs | 1 |
| learning rate | .002 |
| weight decay | .01 |
| batch size | 8 |
| context length (tokens) | 256 |
Trained on 1 Tesla T4 from [google colab](https://colab.research.google.com/)
```TrainOutput(global_step=147, training_loss=1.665000240818984, metrics={'train_runtime': 2828.7347, 'train_samples_per_second': 0.417, 'train_steps_per_second': 0.052, 'total_flos': 1555992281088.0, 'train_loss': 1.665000240818984, 'epoch': 1.0})```
A good starting point to finetune your own gpt-j-6b would be [hivemind's 8bit training code](https://huggingface.co/hivemind/gpt-j-6B-8bit), or with the notebook in [this repository](https://github.com/aicrumb/gpt-j-8bit) which you can download and open in [google colab](https://colab.research.google.com/) or any other ipynb service
No LORA adapters were used for the sake of easy loading and inference with 🤗. Only Linear biases and LayerNorm scales were passed to the optimizer.
## Intended Use and Limitations
(same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
### How to use
```python
# libraries and a wrapper around hivemind's quantization code
!pip install transformers==4.14.1 bitsandbytes-cuda111==0.26.0 git+https://github.com/aicrumb/transformers-8bit -q
import transformers_8bit
model, tokenizer, config = transformers_8bit.load_gptj("crumb/gpt-j-6b-shakespeare", device='cuda')
prompt = tokenizer("Romeo:", return_tensors='pt')
prompt = {key: value.to('cuda') for key, value in prompt.items()}
out = model.generate(**prompt, min_length=64, max_length=64, do_sample=True, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0]))
""" example output
Romeo: [Aside] And but in night, how tedious
Is the day's celebration!
JULIET: [Aside] O me! how quick skips time!
Bid Time himself look out And, after no long date,
Call time up o'er-head,
"""
```
### Limitations and Biases
(same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-J to produce factually accurate output.
GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text. See [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed analysis of the biases in the Pile.
As with all language models, it is hard to predict in advance how GPT-J will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
## To do:
- clean up training code & create github repo for training related models
- see if converting to fp16 or fp32 fixes the inference on the card
## Citations and Related Information
```bibtex
@misc{gpt-j,
author = {Wang, Ben and Komatsuzaki, Aran},
title = {{GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model}},
howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
year = 2021,
month = May
}
```
```bibtex
@misc{mesh-transformer-jax,
author = {Wang, Ben},
title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
year = 2021,
month = May
}
```
```bibtex
@misc{
author={Karpathy, Andrej},
title={char-rnn},
year={2015},
howpublished={\url{https://github.com/karpathy/char-rnn}}
}
```