bhenrym14
/

airoboros-7b-gpt4-1.4.1-lxctx-PI-16384-GPTQ

Text Generation

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 10, 2023

Commit

195500b

•

1 Parent(s): 273c8d0

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -25,6 +25,8 @@ REQUIRED: you'll need to patch in the appropriate RoPE scaling module. see: [rep
 Hopefully there is a quick fix to exllama that can make >8k work soon.
 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. Here I attempt to take a smaller model and extend the context to 16k tokens. This, however, proved problematic as stability suffered in the 8-10k+ range. The Meta paper demonstrated that decreasing perplexities can still be acheived at these context lengths; however, their approach involved tuning all variables on the maximum sequence length after incorporating the RoPE scaling adjustment.

 Hopefully there is a quick fix to exllama that can make >8k work soon.
+Otherwise for context <8k. Use exllama. Set `max_seq_len` to 16384, and `compress_pos_emb` to 8.
 ## Motivation
 Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. Here I attempt to take a smaller model and extend the context to 16k tokens. This, however, proved problematic as stability suffered in the 8-10k+ range. The Meta paper demonstrated that decreasing perplexities can still be acheived at these context lengths; however, their approach involved tuning all variables on the maximum sequence length after incorporating the RoPE scaling adjustment.