Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ The finetune was performed with 1x RTX 6000 Ada.
|
|
22 |
|
23 |
## How to Use
|
24 |
|
25 |
-
This model employs linear RoPE scaling, which
|
26 |
|
27 |
Please comment with any questions.
|
28 |
|
@@ -30,7 +30,7 @@ Ooba use: Be sure to increase the `Truncate the prompt up to this length` parame
|
|
30 |
|
31 |
## Motivation
|
32 |
|
33 |
-
Given the excellent performance of llama-2 13b finetunes relative to llama 33b, I have received several requests for a 16k model using the latest airoboros dataset. Furthermore, while partial NTK scaling appears to be better for retaining short context performance, it is not natively supported in `transformers` and is thus not as accessible to less technical audiences. This model is designed to offer long context capabilites with the stylistic characteristics of the new airoboros dataset.
|
34 |
|
35 |
## Relative Performance (wikitext perplexity)
|
36 |
|
@@ -43,7 +43,7 @@ Given the excellent performance of llama-2 13b finetunes relative to llama 33b,
|
|
43 |
| 8192 | **4.71** | **4.71** | 4.90 | 5.32 | Not Tested | 57.1 |
|
44 |
| 12000 | **4.54** | 55 | 4.82 | 56.1 | Not Tested | Not Tested |
|
45 |
|
46 |
-
- Larger PI scaling factors increase short context performance degradation. If you don't require 16k context, you're better off using a model with a different context extension method, or a smaller (or no) PI scaling factor. Given this, don't expect anything special on the HF leaderboard.
|
47 |
- Beyond 8k, this model has lower perplexity than all other models tested here.
|
48 |
- I'm actively exploring/implementing other context extension methods that may ameliorate the tendency of PI methods to impair the ability of the model to attend to the context space equally.
|
49 |
|
|
|
22 |
|
23 |
## How to Use
|
24 |
|
25 |
+
This model employs linear RoPE scaling, which now has native support in `Transformers` (be sure to update it if you have issues). Use it as you would with any normal context length variant.
|
26 |
|
27 |
Please comment with any questions.
|
28 |
|
|
|
30 |
|
31 |
## Motivation
|
32 |
|
33 |
+
Given the excellent performance of llama-2 13b finetunes relative to llama 33b, I have received several requests for a 16k model using the latest airoboros dataset. Furthermore, while partial NTK scaling appears to be better for retaining short context performance, it is not natively supported in `transformers` and is thus not as accessible to less technical audiences. This model is designed to offer long context capabilites with the stylistic characteristics of the new airoboros dataset without any additional configuration.
|
34 |
|
35 |
## Relative Performance (wikitext perplexity)
|
36 |
|
|
|
43 |
| 8192 | **4.71** | **4.71** | 4.90 | 5.32 | Not Tested | 57.1 |
|
44 |
| 12000 | **4.54** | 55 | 4.82 | 56.1 | Not Tested | Not Tested |
|
45 |
|
46 |
+
- Larger PI scaling factors increase short context performance degradation. If you don't require 16k context, you're better off using a model with a different context extension method, or a smaller (or no) PI scaling factor. Given this, don't expect anything special from this model on the HF leaderboard. Whether or not this is relevant to you will depend on your intended use case.
|
47 |
- Beyond 8k, this model has lower perplexity than all other models tested here.
|
48 |
- I'm actively exploring/implementing other context extension methods that may ameliorate the tendency of PI methods to impair the ability of the model to attend to the context space equally.
|
49 |
|