alokabhishek
/

Mistral-7B-Instruct-v0.2-8.0-bpw-exl2

@@ -1,15 +1,146 @@
 ---
 license: apache-2.0
 pipeline_tag: text-generation
 tags:
-- finetuned
-inference: true
-widget:
-- messages:
-  - role: user
-    content: What is your favorite condiment?
 ---
 # Model Card for Mistral-7B-Instruct-v0.2
 The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2.

 ---
+library_name: transformers
 license: apache-2.0
 pipeline_tag: text-generation
 tags:
+- ExLlamaV2
+- 8bit
+- Mistral
+- Mistral-7B
+- quantized
+- exl2
+- 8.0-bpw
 ---
+# Model Card for alokabhishek/Mistral-7B-Instruct-v0.2-8.0-bpw-exl2
+<!-- Provide a quick summary of what the model is/does. -->
+This repo contains 8-bit quantized (using ExLlamaV2) model Mistral AI_'s Mistral-7B-Instruct-v0.2
+## Model Details
+- Model creator: [Mistral AI_](https://huggingface.co/mistralai)
+- Original model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
+### About quantization using ExLlamaV2
+- ExLlamaV2 github repo: [ExLlamaV2 github repo](https://github.com/turboderp/exllamav2)
+# How to Get Started with the Model
+Use the code below to get started with the model.
+## How to run from Python code
+#### First install the package
+```shell
+# Install ExLLamaV2
+!git clone https://github.com/turboderp/exllamav2
+!pip install -e exllamav2
+```
+#### Import
+```python
+from huggingface_hub import login, HfApi, create_repo
+from torch import bfloat16
+import locale
+import torch
+import os
+```
+#### set up variables
+```python
+# Define the model ID for the desired model
+model_id = "alokabhishek/Mistral-7B-Instruct-v0.2-8.0-bpw-exl2"
+BPW = 8.0
+# define variables
+model_name =  model_id.split("/")[-1]
+```
+#### Download the quantized model
+```shell
+!git-lfs install
+# download the model to loacl directory
+!git clone https://{username}:{HF_TOKEN}@huggingface.co/{model_id} {model_name}
+```
+#### Run Inference on quantized model using
+```shell
+# Run model
+!python exllamav2/test_inference.py -m {model_name}/ -p "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
+```
+```python
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav2 import (
+    ExLlamaV2,
+    ExLlamaV2Config,
+    ExLlamaV2Cache,
+    ExLlamaV2Tokenizer,
+)
+from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler
+import time
+# Initialize model and cache
+model_directory = "/model_path/Mistral-7B-Instruct-v0.2-8.0-bpw-exl2/"
+print("Loading model: " + model_directory)
+config = ExLlamaV2Config(model_directory)
+model = ExLlamaV2(config)
+cache = ExLlamaV2Cache(model, lazy=True)
+model.load_autosplit(cache)
+tokenizer = ExLlamaV2Tokenizer(config)
+# Initialize generator
+generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
+# Generate some text
+settings = ExLlamaV2Sampler.Settings()
+settings.temperature = 0.85
+settings.top_k = 50
+settings.top_p = 0.8
+settings.token_repetition_penalty = 1.01
+settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
+prompt = "Tell me a funny joke about Large Language Models meeting a Blackhole in an intergalactic Bar."
+max_new_tokens = 512
+generator.warmup()
+time_begin = time.time()
+output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)
+time_end = time.time()
+time_total = time_end - time_begin
+print(output)
+print()
+print(f"Response generated in {time_total:.2f} seconds")
+```
 # Model Card for Mistral-7B-Instruct-v0.2
 The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2.