brucethemoose
/

Yi-34B-200K-DARE-merge-v5

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

brucethemoose commited on Dec 17, 2023

Commit

ac4b397

•

1 Parent(s): 51a7be4

Update README.md

Files changed (1) hide show

README.md +22 -22

README.md CHANGED Viewed

@@ -16,8 +16,29 @@ A merge of [**Nous-Capybara-34B**](https://huggingface.co/NousResearch/Nous-Capy
 > https://github.com/cg123/mergekit/tree/dare
-Merged with with the following config, and the tokenizer from chargoddard's Yi-Llama:
 ```
 models:
@@ -65,27 +86,6 @@ parameters:
   int8_mask: true
 dtype: bfloat16
 ```
-***
-## Prompt template: Orca-Vicuna
-```
-SYSTEM: {system_message}
-USER: {prompt}
-ASSISTANT:
-```
-It might recognize ChatML, or maybe Llama-chat from Airoboros.
-Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
-***
-## Running
-Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
-24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
-I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
-To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
-***
-## Testing Notes
 Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.

 > https://github.com/cg123/mergekit/tree/dare
+***
+## Prompt template: Orca-Vicuna
+```
+SYSTEM: {system_message}
+USER: {prompt}
+ASSISTANT:
+```
+It might recognize ChatML, or maybe Llama-chat from Airoboros.
+Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
+***
+## Running
+Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
+24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
+I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
+To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
+***
+## Testing Notes
+Merged in mergekit with the following config, and the tokenizer from chargoddard's Yi-Llama:
 ```
 models:
   int8_mask: true
 dtype: bfloat16
 ```
 Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.