bhenrym14
/

airophin-13b-pntk-16k-GPTQ

Text Generation

Transformers

llama

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 25, 2023

Commit

b8db8fb

•

1 Parent(s): 360ba9f

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -30,6 +30,8 @@ This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scal
 2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
 3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba imported from sitepackages for me). I hacked this together very quickly so don't be surprised if something goes wrong. It shouldn't break functionality with normal models (as long as the model config file does not have `original_max_embeddings` defined) but I haven't tested this.
 Please comment with any questions. This hasn't been extensively tested.
 ## Motivation
@@ -50,10 +52,10 @@ Here I explore whether training on long sequences that have clear conceptual dep
 | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
 | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
-- This model is very competitive with the Llama-1 33b extended context variants.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
-- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Quantization:

 2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
 3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba imported from sitepackages for me). I hacked this together very quickly so don't be surprised if something goes wrong. It shouldn't break functionality with normal models (as long as the model config file does not have `original_max_embeddings` defined) but I haven't tested this.
+**If using ooba, be sure to increase the `Truncate the prompt up to this length` parameter under the `parameters` tab to 16384.**
 Please comment with any questions. This hasn't been extensively tested.
 ## Motivation
 | 8192 | **4.90** | 5.32 | Not Tested | 57.1 |
 | 12000 | **4.82** | 56.1 | Not Tested | Not Tested |
+- This model is very competitive with the Llama-1 33b extended context variants. In particular, at 512 tokens it has lower perplexity. This is probably an improvement imparted (in part) by the NTK by parts scaling method.
 - Not presented here, but this model outperforms the base llama-2-13b on MMLU-fs with a score of 54.9. While perhaps an insignificant difference, the fact there isn't a clear performance regression despite the context extension is notable.
 - Perplexity continues to decline to 12000 tokens, the longest context length I tested due to VRAM constraints.
+- Feedback regarding real-world performance is appreciated. I don't know if the first dolphin training phase really contributed much beyond what pile did for the 33b-lxctx model; many relevant modeling components changed here, so it's difficult to make any specific attributions. The base model improvement may very well be the most dominant change.
 ## Quantization: