crumb
/

gpt-j-6b-shakespeare

+---
+language:
+- en
+tags:
+- pytorch
+- causal-lm
+datasets:
+- The Pile
+- tiny_shakespeare
+---
+# GPT-J 6b Shakespeare
+**i thought i fixed the batch size bug, but apparently not, so this was trained on bs=1, i'm in the process of fixing and retraining**
+The "Hosted inference API" does not work. Go to the "How to use" section or use [this colab notebook](https://colab.research.google.com/drive/1gK_iTV3HKgNUEpzuVZuEVMyZn9K9aLpO?usp=sharing)
+## Model Description
+GPT-J 6B is a transformer model trained using Ben Wang's [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax/). "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
+This checkpoint is a finetuned version of the original [GPT-J 6b](https://huggingface.co/EleutherAI/gpt-j-6B) on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare)
+## Training data
+GPT-J 6B was trained on [the Pile](https://pile.eleuther.ai), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai).
+This checkpoint was afterwards finetuned on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare) by [crumb](https://huggingface.co/crumb) (me)
+> 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
+## Training Procedure
+| Parameter | Value      |
+|----------------------|------------|
+| epochs     | 1 |
+| learning rate | .002 |
+| weight decay | .01 |
+| schedule type | linear |
+| warmup steps | 500 |
+| batch size | 8 |
+| context length (tokens) | 256 |
+I used a modified version of [hivemind's 8bit training script](https://huggingface.co/hivemind/gpt-j-6B-8bit) on 1 Tesla T4 for ~15 minutes
+No LORA adapters were used for the sake of easy loading and inference with 🤗. Finetuning was done traditionally (all parameters were passed to optimizer)
+End loss: 0.1757839471101761
+## Intended Use and Limitations
+(same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
+GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
+### How to use
+[auto generated py file](https://huggingface.co/crumb/gpt-j-6b-shakespeare/blob/main/shakespeare-inference.py)
+[notebook](https://huggingface.co/crumb/gpt-j-6b-shakespeare/blob/main/shakespeare-inference.ipynb)
+### Limitations and Biases
+(same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
+The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-J to produce factually accurate output.
+GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text. See [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed analysis of the biases in the Pile.
+As with all language models, it is hard to predict in advance how GPT-J will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
+## To do:
+- clean up training code & create github repo for training related models
+- see if converting to fp16 or fp32 fixes the inference on the card
+## Citations and Related Information
+```bibtex
+@misc{gpt-j,
+  author = {Wang, Ben and Komatsuzaki, Aran},
+  title = {{GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model}},
+  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
+  year = 2021,
+  month = May
+}
+```
+```bibtex
+@misc{mesh-transformer-jax,
+  author = {Wang, Ben},
+  title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
+  howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
+  year = 2021,
+  month = May
+}
+```
+```bibtex
+@misc{
+  author={Karpathy, Andrej},
+  title={char-rnn},
+  year={2015},
+  howpublished={\url{https://github.com/karpathy/char-rnn}}
+}
+```