brucethemoose commited on
Commit
eb39dbf
1 Parent(s): 992011f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -85,7 +85,9 @@ Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you
85
  To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
86
 
87
  ***
88
- 24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/), and recommend exl2 quantizations on data similar to the desired task.
 
 
89
  ***
90
 
91
  Credits:
 
85
  To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
86
 
87
  ***
88
+ 24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2. I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
89
+
90
+ I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw!
91
  ***
92
 
93
  Credits: