alokabhishek
/

Mistral-7B-Instruct-v0.2-4.0-bpw-exl2

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

alokabhishek commited on Apr 1

Commit

773b9bf

•

1 Parent(s): 83b7d62

Updated Readme

Files changed (1) hide show

README.md +64 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ tags:
 - 4.0-bpw
 ---
-# Model Card for alokabhishek/Mistral-7B-Instruct-v0.2-5.0-bpw-exl2
 <!-- Provide a quick summary of what the model is/does. -->
 This repo contains 4-bit quantized (using ExLlamaV2) model Mistral AI_'s Mistral-7B-Instruct-v0.2
@@ -79,6 +79,69 @@ model_name =  model_id.split("/")[-1]
 ```shell
 # Run model
 !python exllamav2/test_inference.py -m {model_name}/ -p "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
 ```

 - 4.0-bpw
 ---
+# Model Card for alokabhishek/Mistral-7B-Instruct-v0.2-4.0-bpw-exl2
 <!-- Provide a quick summary of what the model is/does. -->
 This repo contains 4-bit quantized (using ExLlamaV2) model Mistral AI_'s Mistral-7B-Instruct-v0.2
 ```shell
 # Run model
 !python exllamav2/test_inference.py -m {model_name}/ -p "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
+```
+```python
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav2 import (
+    ExLlamaV2,
+    ExLlamaV2Config,
+    ExLlamaV2Cache,
+    ExLlamaV2Tokenizer,
+)
+from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler
+import time
+# Initialize model and cache
+model_directory = "/model_path/Mistral-7B-Instruct-v0.2-4.0-bpw-exl2/"
+print("Loading model: " + model_directory)
+config = ExLlamaV2Config(model_directory)
+model = ExLlamaV2(config)
+cache = ExLlamaV2Cache(model, lazy=True)
+model.load_autosplit(cache)
+tokenizer = ExLlamaV2Tokenizer(config)
+# Initialize generator
+generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
+# Generate some text
+settings = ExLlamaV2Sampler.Settings()
+settings.temperature = 0.85
+settings.top_k = 50
+settings.top_p = 0.8
+settings.token_repetition_penalty = 1.01
+settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
+prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
+max_new_tokens = 512
+generator.warmup()
+time_begin = time.time()
+output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)
+time_end = time.time()
+time_total = time_end - time_begin
+print(output)
+print()
+print(
+    f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second"
+)
 ```