Text Generation
Transformers
llama
Inference Endpoints
bhenrym14 commited on
Commit
0343b8e
1 Parent(s): 717e68d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -15,7 +15,7 @@ fp16 weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pn
15
  ## Overview
16
 
17
  This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
18
- 1. It is first trained on a long-context (7000-8192 tokens) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset (GPT4 split only). This amounts to roughly 110mm tokens, seen twice over two epochs. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~45 hours.
19
  2. The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
20
 
21
  **This is a QLoRA fine-tune**.
@@ -24,11 +24,10 @@ All training was performed with 1x RTX 6000 Ada.
24
 
25
  ## How to Use
26
 
27
- This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet mplemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384`. A monkeypatch can be found here:
28
  1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
29
  2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
30
 
31
- **Note: Due to an erronious `max_position_embeddings` figure in the base model config file, the RoPE scaling factor was computed with `original_max_position_embeddings=2048` (llama-2 should be 4096). This resulted in a scaling factor of 8 instead of 4, despite passing a new `max_position_embeddings=16384`. This could have a negative to neutral performance impact. I intend on retraining this model with the proper scaling factor. If and when I do so, I will replace the weights in this repo and make note of this change at the top of this model card.**
32
 
33
  ## Motivation
34
  Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.
 
15
  ## Overview
16
 
17
  This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens. There are two training phases:
18
+ 1. It is first trained on a long-context (7000-8192 tokens) subset of [dolphin](https://huggingface.co/datasets/ehartford/dolphin), an orca-like dataset (GPT4 split only). This amounts to roughly 110mm tokens. Airoboros-like training prompt was used, with partial NTK scaling applied. This took ~20 hours.
19
  2. The model was then finetuned on [Jon Durbin's Airoboros GPT4 1.4.1](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) for 3 epochs. This took ~17 hours.
20
 
21
  **This is a QLoRA fine-tune**.
 
24
 
25
  ## How to Use
26
 
27
+ This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are two options to run this, each of which will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found here:
28
  1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16).
29
  2. Autogptq/GPTQ-for-Llama. Use these quantized weights.
30
 
 
31
 
32
  ## Motivation
33
  Methods of extending the useful context window of LLM's have gained significant traction. Several methods requiring little to no finetuning/retraining have emerged. Among these is linear position interpolation [kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) and [NTK aware scaling](https://github.com/jquesnelle/scaled-rope). My prior experiments demonstrate significant performance improvements both from finetuning with these scaling adjustments implemented **and** with longer sequences.