THEODOROS commited on
Commit
e2e3cdc
1 Parent(s): b8a5289

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -5,7 +5,7 @@ Architext GPT-J-162M is a transformer model trained using Ben Wang's Mesh Transf
5
 
6
  The model consists of 12 layers with a model dimension of 768, and a feedforward dimension of 2048. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
7
 
8
- #Training data
9
  GPT-J 162B was pre-trained on the Pile, a large-scale curated dataset created by EleutherAI. It was then finetuned on synthetically generated data that was procedurally generated using the Rhinocers/Grasshopper software suite. The model was finetuned for 1.25 billion tokens over 11,500 steps on TPU v3-8. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
10
 
11
  #Intended Use and Limitations
 
5
 
6
  The model consists of 12 layers with a model dimension of 768, and a feedforward dimension of 2048. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.
7
 
8
+ # Training data
9
  GPT-J 162B was pre-trained on the Pile, a large-scale curated dataset created by EleutherAI. It was then finetuned on synthetically generated data that was procedurally generated using the Rhinocers/Grasshopper software suite. The model was finetuned for 1.25 billion tokens over 11,500 steps on TPU v3-8. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.
10
 
11
  #Intended Use and Limitations