brucethemoose commited on
Commit
bee0946
1 Parent(s): 633b176

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -32,7 +32,7 @@ Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little r
32
 
33
  24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
34
 
35
- I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
36
 
37
  To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
38
  ***
 
32
 
33
  24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
34
 
35
+ I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw. I've published my own fiction-oriented quantizations here: https://huggingface.co/collections/brucethemoose/most-recent-merge-65742644ca03b6c514afa204
36
 
37
  To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
38
  ***