gpt-neox-20b / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
06a9a55
|
raw
history blame
8.97 kB
metadata
language:
  - en
tags:
  - pytorch
  - causal-lm
license: apache-2.0
datasets:
  - EleutherAI/pile

GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile using the GPT-NeoX library. Its architecture intentionally resembles that of GPT-3, and is almost identical to that of GPT-J- 6B. Its training dataset contains a multitude of English-language texts, reflecting the general-purpose nature of this model. See the accompanying paper for details about model architecture (including how it differs from GPT-3), training procedure, and additional evaluations.

Model details

Hyperparameter Value
nparameters 20554567680
nlayers 44
dmodel 6144
nheads 64
dhead 96
nvocab 50257
Sequence Length 2048
Learning Rate 0.97 x 10-5
Positional Encoding Rotary Position Embedding (RoPE)

Uses and limitations

Intended use

GPT-NeoX-20B was developed primarily for research purposes. It learns an inner representation of the English language that can be used to extract features useful for downstream tasks.

In addition to scientific uses, you may also further fine-tune and adapt GPT-NeoX-20B for deployment, as long as your use is in accordance with the Apache 2.0 license. This model works with the Transformers Library. If you decide to use pre-trained GPT-NeoX-20B as a basis for your fine-tuned model, please note that you need to conduct your own risk and bias assessment.

Out-of-scope use

GPT-NeoX-20B is not intended for deployment as-is. It is not a product and cannot be used for human-facing interactions without supervision.

GPT-NeoX-20B has not been fine-tuned for downstream tasks for which language models are commonly deployed, such as writing genre prose, or commercial chatbots. This means GPT-NeoX-20B will likely not respond to a given prompt the way products such as ChatGPT do. This is because, unlike GPT-NeoX-20B, ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human Feedback (RLHF) to better “understand” human instructions and dialogue.

This model is English-language only, and thus cannot be used for translation or generating text in other languages.

Limitations and biases

The core functionality of GPT-NeoX-20B is to take a string of text and predict the next token. Remember that the statistically most likely next token need not result in the most “accurate” text. Never rely on GPT-NeoX-20B to produce factually accurate output.

This model was trained on the Pile, a dataset known to contain profanity and texts that are lewd or otherwise offensive. See Section 6 of the Pile paper for a discussion of documented biases with regards to gender, religion, and race. GPT-NeoX-20B may produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

We recommend curating the outputs of this model before presenting it to a human reader. Please inform your audience that you are using artificially generated text.

How to use

If you simply want to try out some prompts, check out this playground.

GPT-NeoX-20B can be loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")

Training

Training dataset

The Pile is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See the Pile paper for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult the datasheet for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the official website, or from a community mirror.

The Pile was not deduplicated before being used to train GPT-NeoX-20B.

Training procedure

GPT-NeoX-20B was trained with a batch size of approximately 3.15M tokens (1538 sequences of 2048 tokens each), for a total of 150,000 steps. Tensor parallelism and pipeline parallelism were used to distribute the model across GPUs. Additional details about the training procedure are in Section 3 of the accompanying paper.

Evaluations

Model OpenAI’s LAMBADA SciQ PIQA TriviaQA ARC (Challenge)
GPT-J-6B 0.683 ± 0.006 0.910 ± 0.009 0.752 ± 0.010 0.170 ± 0.004 0.340 ± 0.014
FairSeq 6.7B 0.673 ± 0.007 0.895 ± 0.010 0.762 ± 0.010 0.221 ± 0.004 0.329 ± 0.014
GPT-3 Curie 0.693 ± 0.006 0.918 ± 0.009 0.767 ± 0.010 0.196 ± 0.004 0.334 ± 0.014
FairSeq 13B 0.709 ± 0.006 0.910 ± 0.009 0.769 ± 0.010 0.270 ± 0.004 0.345 ± 0.014
GPT-NeoX-20B 0.720 ± 0.006 0.928 ± 0.008 0.779 ± 0.010 0.259 ± 0.004 0.380 ± 0.014
GPT-3 DaVinci 0.752 ± 0.006 0.949 ± 0.007 0.791 ± 0.009 0.409 ± 0.005 0.435 ± 0.014
Zero-shot performance on selected natural language tasks.

This is a heavily abridged version of the evaluation results. Appendix D of the GPT-NeoX-20B paper compares more model sizes, and contains additional evaluations, including on: zero and five-shot natural language tasks, zero and five-shot Basic Arithmetic and MATH, and zero-shot Hendrycks tasks.

BibTeX

To cite the GPT-NeoX-20B paper:

@misc{https://doi.org/10.48550/arxiv.2204.06745,
  doi = {10.48550/ARXIV.2204.06745},
  
  url = {https://arxiv.org/abs/2204.06745},
  
  author = {Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel},
  
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {GPT-NeoX-20B: An Open-Source Autoregressive Language Model},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 36.02
ARC (25-shot) 45.73
HellaSwag (10-shot) 73.45
MMLU (5-shot) 25.0
TruthfulQA (0-shot) 31.61
Winogrande (5-shot) 68.9
GSM8K (5-shot) 2.43
DROP (3-shot) 5.04