crumb commited on
Commit
f5922a7
1 Parent(s): e220111

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ datasets:
8
+ - The Pile
9
+ - tiny_shakespeare
10
+ ---
11
+
12
+ # GPT-J 6b Shakespeare
13
+
14
+ **i thought i fixed the batch size bug, but apparently not, so this was trained on bs=1, i'm in the process of fixing and retraining**
15
+
16
+ The "Hosted inference API" does not work. Go to the "How to use" section or use [this colab notebook](https://colab.research.google.com/drive/1gK_iTV3HKgNUEpzuVZuEVMyZn9K9aLpO?usp=sharing)
17
+
18
+ ## Model Description
19
+
20
+ GPT-J 6B is a transformer model trained using Ben Wang's [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax/). "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.
21
+
22
+ This checkpoint is a finetuned version of the original [GPT-J 6b](https://huggingface.co/EleutherAI/gpt-j-6B) on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare)
23
+
24
+ ## Training data
25
+
26
+ GPT-J 6B was trained on [the Pile](https://pile.eleuther.ai), a large-scale curated dataset created by [EleutherAI](https://www.eleuther.ai).
27
+
28
+ This checkpoint was afterwards finetuned on [tiny_shakespeare](https://huggingface.co/datasets/tiny_shakespeare) by [crumb](https://huggingface.co/crumb) (me)
29
+
30
+ > 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
31
+
32
+
33
+ ## Training Procedure
34
+
35
+ | Parameter | Value |
36
+ |----------------------|------------|
37
+ | epochs | 1 |
38
+ | learning rate | .002 |
39
+ | weight decay | .01 |
40
+ | schedule type | linear |
41
+ | warmup steps | 500 |
42
+ | batch size | 8 |
43
+ | context length (tokens) | 256 |
44
+
45
+ I used a modified version of [hivemind's 8bit training script](https://huggingface.co/hivemind/gpt-j-6B-8bit) on 1 Tesla T4 for ~15 minutes
46
+
47
+ No LORA adapters were used for the sake of easy loading and inference with 🤗. Finetuning was done traditionally (all parameters were passed to optimizer)
48
+
49
+ End loss: 0.1757839471101761
50
+
51
+ ## Intended Use and Limitations
52
+
53
+ (same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
54
+ GPT-J learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt.
55
+
56
+
57
+ ### How to use
58
+
59
+ [auto generated py file](https://huggingface.co/crumb/gpt-j-6b-shakespeare/blob/main/shakespeare-inference.py)
60
+
61
+ [notebook](https://huggingface.co/crumb/gpt-j-6b-shakespeare/blob/main/shakespeare-inference.ipynb)
62
+
63
+ ### Limitations and Biases
64
+
65
+ (same as [gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6B))
66
+
67
+ The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-J to produce factually accurate output.
68
+
69
+ GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text. See [Sections 5 and 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed analysis of the biases in the Pile.
70
+
71
+ As with all language models, it is hard to predict in advance how GPT-J will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
72
+
73
+ ## To do:
74
+ - clean up training code & create github repo for training related models
75
+ - see if converting to fp16 or fp32 fixes the inference on the card
76
+
77
+ ## Citations and Related Information
78
+
79
+ ```bibtex
80
+ @misc{gpt-j,
81
+ author = {Wang, Ben and Komatsuzaki, Aran},
82
+ title = {{GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model}},
83
+ howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
84
+ year = 2021,
85
+ month = May
86
+ }
87
+ ```
88
+
89
+ ```bibtex
90
+ @misc{mesh-transformer-jax,
91
+ author = {Wang, Ben},
92
+ title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
93
+ howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
94
+ year = 2021,
95
+ month = May
96
+ }
97
+ ```
98
+
99
+ ```bibtex
100
+ @misc{
101
+ author={Karpathy, Andrej},
102
+ title={char-rnn},
103
+ year={2015},
104
+ howpublished={\url{https://github.com/karpathy/char-rnn}}
105
+ }
106
+ ```