bhenrym14
/

airophin-13b-pntk-16k-fp16

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bhenrym14 commited on Jul 25, 2023

Commit

74b22f0

•

1 Parent(s): 4ad66b7

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ datasets:
 <!-- LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA -->
-fp16 weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16
 ## Overview
@@ -27,8 +27,8 @@ All training was performed with 1x RTX 6000 Ada.
 This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are three options to run this.
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16). This will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found [here](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_pntk_monkey_patch.py).
-2. Autogptq/GPTQ-for-Llama. Use these quantized weights. Make the same replacement as in 1.
-3. Use ExLLama, replacing the `model.py` file with the [modified version](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/exllama_pntk/model.py). Use `compress_pos_emb=1` and `alpha_value = 1` (defaults). The necessary scaling values should flow from the configuration file. If you have done this correctly, there should be a dump of indications in the console indicating the scaling factor used (should be 4). If not, be sure your client is importing exllama from where you replaced the file. (ooba was from sitepackages for me). I hacked this together very quickly so don't be surprised if something goes wrong.
 Please comment with any questions. This hasn't been extensively tested.

 <!-- LoRA Weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-LoRA -->
+GPTQ weights can be found here: https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-GPTQ
 ## Overview
 This model employs [Partial NTK Rope Scaling](https://github.com/jquesnelle/scaled-rope/pull/1). This methodology is not yet implemented natively in Transformers or Exllama (as of 7/21). There are three options to run this.
 1. Transformers (use bnb for quantization). Use [fp16 weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16). This will require replacing the `LlamaEmbedding` with `LlamaPartNTKScaledRotaryEmbedding`, with `max_position_embeddings=16384` and `original_max_position_embeddings=4096`. A monkeypatch can be found [here](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_pntk_monkey_patch.py).
+2. Autogptq/GPTQ-for-Llama. See the [GPTQ weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-GPTQ)
+3. Use ExLLama, see the [GPTQ weights](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-GPTQ)
 Please comment with any questions. This hasn't been extensively tested.