wenbopan commited on
Commit
8bfc9c6
1 Parent(s): 4868f40
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -17,7 +17,7 @@ Faro-Yi-9B-200K is an improved [Yi-9B-200K](https://huggingface.co/01-ai/Yi-9B-2
17
 
18
  ## How to Use
19
 
20
- Faro-Yi-9B-200K uses the chatml template and performs well in both short and long contexts. For longer inputs, I recommend to use vLLM to have a max prompt of 32K under 24GB of VRAM. Setting `kv_cache_dtype="fp8_e5m2"` allows for 48K input length. 4bit-AWQ quantization on top of that can boost input length to 160K, albeit with some performance impact. Adjust `max_model_len` arg in vLLM or `config.json` to avoid OOM.
21
 
22
 
23
  ```python
 
17
 
18
  ## How to Use
19
 
20
+ Faro-Yi-9B-200K uses the chatml template and performs well in both short and long contexts. For longer inputs under **24GB of VRAM**, I recommend to use vLLM to have a max prompt of 32K. Setting `kv_cache_dtype="fp8_e5m2"` allows for 48K input length. 4bit-AWQ quantization on top of that can boost input length to 160K, albeit with some performance impact. Adjust `max_model_len` arg in vLLM or `config.json` to avoid OOM.
21
 
22
 
23
  ```python