mobiuslabsgmbh
/

Llama-3-8b-instruct_2bitgs64_hqq

Text Generation

Model card Files Files and versions Community

mobicham commited on Aug 16, 2024

Commit

e3c2810

·

verified ·

1 Parent(s): 25d87d5

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -77,6 +77,7 @@ prepare_for_inference(model, backend="bitblas", allow_merge=False) #It takes a w
 #Generate
 ###################################################
 #gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up
 gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while

 #Generate
 ###################################################
+#For longer context, make sure to allocate enough cache via the cache_size= parameter
 #gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile=None) #Slower generation but no warm-up
 gen = HFGenerator(model, tokenizer, max_new_tokens=1000, do_sample=True, compile="partial").warmup() #Faster generation, but warm-up takes a while