bhenrym14
/

airophin-13b-pntk-16k-fp16

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 25, 2023

Commit

4a04f5a

•

1 Parent(s): 89e2fd2

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -50,11 +50,10 @@ Here I explore whether training on long sequences that have clear conceptual dep
 | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
 | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
-- This model is very competitive with the Llama-1 33b extended context variants.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
-- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Prompting:

 | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
 | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
+- This model is very competitive with the Llama-1 33b extended context variants. In particular, at 512 tokens it has lower perplexity. This is probably an improvement imparted (in part) by the NTK by parts scaling method.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
+- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much beyond what pile did for the 33b-lxctx model; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Prompting: