brucethemoose commited on
Commit
ac4b397
1 Parent(s): 51a7be4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -22
README.md CHANGED
@@ -16,8 +16,29 @@ A merge of [**Nous-Capybara-34B**](https://huggingface.co/NousResearch/Nous-Capy
16
 
17
  > https://github.com/cg123/mergekit/tree/dare
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- Merged with with the following config, and the tokenizer from chargoddard's Yi-Llama:
21
 
22
  ```
23
  models:
@@ -65,27 +86,6 @@ parameters:
65
  int8_mask: true
66
  dtype: bfloat16
67
  ```
68
- ***
69
- ## Prompt template: Orca-Vicuna
70
- ```
71
- SYSTEM: {system_message}
72
- USER: {prompt}
73
- ASSISTANT:
74
- ```
75
- It might recognize ChatML, or maybe Llama-chat from Airoboros.
76
-
77
- Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
78
- ***
79
- ## Running
80
- Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
81
-
82
- 24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
83
-
84
- I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
85
-
86
- To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
87
- ***
88
- ## Testing Notes
89
 
90
  Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.
91
 
 
16
 
17
  > https://github.com/cg123/mergekit/tree/dare
18
 
19
+ ***
20
+ ## Prompt template: Orca-Vicuna
21
+ ```
22
+ SYSTEM: {system_message}
23
+ USER: {prompt}
24
+ ASSISTANT:
25
+ ```
26
+ It might recognize ChatML, or maybe Llama-chat from Airoboros.
27
+
28
+ Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
29
+ ***
30
+ ## Running
31
+ Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
32
+
33
+ 24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
34
+
35
+ I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
36
+
37
+ To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
38
+ ***
39
+ ## Testing Notes
40
 
41
+ Merged in mergekit with the following config, and the tokenizer from chargoddard's Yi-Llama:
42
 
43
  ```
44
  models:
 
86
  int8_mask: true
87
  dtype: bfloat16
88
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.
91