Update README.md
Browse files
README.md
CHANGED
@@ -50,11 +50,10 @@ Here I explore whether training on long sequences that have clear conceptual dep
|
|
50 |
| 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
|
51 |
| 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
|
52 |
|
53 |
-
- This model is very competitive with the Llama-1 33b extended context variants.
|
54 |
- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
|
55 |
- Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
|
56 |
-
- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
|
57 |
-
|
58 |
|
59 |
## Prompting:
|
60 |
|
|
|
50 |
| 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
|
51 |
| 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
|
52 |
|
53 |
+
- This model is very competitive with the Llama-1 33b extended context variants. In particular, at 512 tokens it has lower perplexity. This is probably an improvement imparted (in part) by the NTK by parts scaling method.
|
54 |
- Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
|
55 |
- Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
|
56 |
+
- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much beyond what pile did for the 33b-lxctx model; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
|
|
|
57 |
|
58 |
## Prompting:
|
59 |
|