bhenrym14 commited on
Commit
4a04f5a
1 Parent(s): 89e2fd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -50,11 +50,10 @@ Here I explore whether training on long sequences that have clear conceptual dep
50
  | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
51
  | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
52
 
53
- - This model is very competitive with the Llama-1 33b extended context variants.
54
  - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
55
  - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
56
- - Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
57
-
58
 
59
  ## Prompting:
60
 
 
50
  | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
51
  | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
52
 
53
+ - This model is very competitive with the Llama-1 33b extended context variants. In particular, at 512 tokens it has lower perplexity. This is probably an improvement imparted (in part) by the NTK by parts scaling method.
54
  - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
55
  - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
56
+ - Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much beyond what pile did for the 33b-lxctx model; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 
57
 
58
  ## Prompting:
59