EleutherAI
/

gpt-neo-1.3B

@@ -48,25 +48,31 @@ GPT-Neo was trained as an autoregressive language model. This means that its cor
 GPT-Neo was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending on your usecase GPT-Neo may produce socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
 As with all language models, it is hard to predict in advance how GPT-Neo will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Eval results
-### Language Modeling Baselines
-EleutherAI is currently in the process of carrying out further evaluations of GPT-Neo. The following table should be considered a work-in-progress. If you would like to contribute evaluations you have done, please reach out on our Discord.
-| Model and Size   | Pile BPB      | Pile PPL      | Wikitext PPL.  |
-| ---------------- | ------------- | ------------- | -------------- |
-| **GPT-Neo 1.3B** |  **0.7527**   | **6.159**     | **13.10**      |
-| GPT-3 1.3B       |  ------       | -----         | -----          |
-| GPT-2 1.5B       |  1.0468       | -----         | 17.48          |
-| GPT-Neo 2.7B     |  0.7165       | 5.646         | 11.39          |
-| GPT-3 2.7B   |  0.9631       | -----         | -----          |
-| GPT-3 175B       |  0.7177       | -----         | -----          |
-All GPT-2 and GPT-3 scores are from their respective papers, except for the Pile test results which are from the Pile paper.
 ### Down-Stream Applications
 ### BibTeX entry and citation info
 ```bibtex

 GPT-Neo was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending on your usecase GPT-Neo may produce socially unacceptable text. See Sections 5 and 6 of the Pile paper for a more detailed analysis of the biases in the Pile.
 As with all language models, it is hard to predict in advance how GPT-Neo will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ## Eval results
+### Linguistic Reasoning
+| Model and Size   | Pile BPB   | Pile PPL   | Wikitext PPL  | Lambada PPL | Lambada Acc | Winogrande | Hellaswag   |
+| ---------------- | ---------- | ---------- | ------------- | ----------- | ----------- | ---------- | ----------- |
+| **GPT-Neo 1.3B** | **0.7527** | **6.159**  | **13.10**     | **7.498**   | **57.23%**  | **55.01%** | **38.66%**  |
+| GPT-2 1.5B       | 1.0468     | -----      | 17.48         | 10.634      | 51.21%      | 59.40%     | 40.03%      |
+| GPT-Neo 2.7B     | 0.7165     | 5.646      | 11.39         | 5.626       | 62.22%      | 56.50%     | 42.73%      |
+| GPT-3 Ada        | 0.9631     | -----      | -----         | 9.954       | 51.60%      | 52.90%     | 35.93%      |
+### Physical and Scientific Reasoning
+| Model and Size   | MathQA     | PubMedQA   | Piqa        |
+| ---------------- | ---------- | ---------- | ----------- |
+| **GPT-Neo 1.3B** | **24.05%** | **54.40%** | **71.11%**  |
+| GPT-2 1.5B       | 23.64%     | 58.33%     | 70.78%      |
+| GPT-Neo 2.7B     | 24.72%     | 57.54%     | 72.14%      |
+| GPT-3 Ada        | 24.29%     | 52.80%     | 68.88%      |
 ### Down-Stream Applications
+TBD
 ### BibTeX entry and citation info
 ```bibtex