This repo contains AWQ model files for SuperAGI's SAM.

About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

AWQ models are currently supported on Linux and Windows, with NVidia GPUs only. macOS users: please use GGUF models instead.

It is supported by:

Repositories available

Prompt template: Mistral

[INST] {prompt} [/INST]

Provided files, and AWQ parameters

I currently release 128g GEMM models only. The addition of group_size 32 models, and GEMV kernel models, is being actively considered.

Models are released as sharded safetensors files.

Branch Bits GS AWQ Dataset Seq Len Size
main 4 128 VMware Open Instruct 4096 4.15 GB

How to easily download and use this model in text-generation-webui

Please make sure you're using the latest version of text-generation-webui.

It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.

  1. Click the Model tab.
  2. Under Download custom model or LoRA, enter TheBloke/SAM-AWQ.
  3. Click Download.
  4. The model will start downloading. Once it's finished it will say "Done".
  5. In the top left, click the refresh icon next to Model.
  6. In the Model dropdown, choose the model you just downloaded: SAM-AWQ
  7. Select Loader: AutoAWQ.
  8. Click Load, and the model will load and is now ready for use.
  9. If you want any custom settings, set them and then click Save settings for this model followed by Reload the Model in the top right.
  10. Once you're ready, click the Text Generation tab and enter a prompt to get started!

Multi-user inference server: vLLM

Documentation on installing and using vLLM can be found here.

  • Please ensure you are using vLLM version 0.2 or later.
  • When using vLLM as a server, pass the --quantization awq parameter.

For example:

python3 -m vllm.entrypoints.api_server --model TheBloke/SAM-AWQ --quantization awq --dtype auto
  • When using vLLM from Python code, again set quantization=awq.

For example:

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
prompt_template=f'''[INST] {prompt} [/INST]

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/SAM-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Multi-user inference server: Hugging Face Text Generation Inference (TGI)

Use TGI version 1.1.0 or later. The official Docker container is: ghcr.io/huggingface/text-generation-inference:1.1.0

Example Docker parameters:

--model-id TheBloke/SAM-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):

pip3 install huggingface-hub
from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''[INST] {prompt} [/INST]

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,

print(f"Model output: ", response)

Inference from Python code using Transformers

Install the necessary packages

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

Note that if you are using PyTorch 2.0.1, the above AutoAWQ command will automatically upgrade you to PyTorch 2.1.0.

If you are using CUDA 11.8 and wish to continue using PyTorch 2.0.1, instead run this command:

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

If you have problems installing AutoAWQ using the pre-built wheels, install it from source instead:

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

Transformers example code (requires Transformers 4.35.0 and later)

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/SAM-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] {prompt} [/INST]

# Convert prompt to tokens
tokens = tokenizer(

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1

# Generate streamed output, visible one token at a time
generation_output = model.generate(

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)


The files provided are tested to work with:


Original model card: SuperAGI's SAM

Model Card

SAM (Small Agentic Model), a 7B model that demonstrates impressive reasoning abilities despite its smaller size. SAM-7B has outperformed existing SoTA models on various reasoning benchmarks, including GSM8k and ARC-C.

For full details of this model please read our release blog post.

Key Contributions

  • SAM-7B outperforms GPT 3.5, Orca, and several other 70B models on multiple reasoning benchmarks, including ARC-C and GSM8k.
  • Interestingly, despite being trained on a 97% smaller dataset, SAM-7B surpasses Orca-13B on GSM8k.
  • All responses in our fine-tuning dataset are generated by open-source models without any assistance from state-of-the-art models like GPT-3.5 or GPT-4.


  • Trained by: SuperAGI Team
  • Hardware: NVIDIA 6 x H100 SxM (80GB)
  • Model used: Mistral 7B
  • Duration of finetuning: 4 hours
  • Number of epochs: 1
  • Batch size: 16
  • Learning Rate: 2e-5
  • Warmup Ratio: 0.1
  • Optmizer: AdamW
  • Scheduler: Cosine

Example Prompt

The template used to build a prompt for the Instruct model is defined as follows:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

Note that <s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.


These benchmarks show that our model has improved reasoning as compared to orca 2-7b, orca 2-13b and GPT-3.5. Despite being smaller in size, we show better multi-hop reasoning, as shown below: Reasoning Benchmark Performance

Note: Temperature=0.3 is the suggested for optimal performance

Run the model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "SuperAGI/SAM"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

text = "Can elephants fly?"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


SAM is a demonstration that better reasoning can be induced using less but high-quality data generated using OpenSource LLMs. The model is not suitable for conversations and simple Q&A, it performs better in task breakdown and reasoning only. It does not have any moderation mechanisms. Therefore, the model is not suitable for production usage as it doesn't have guardrails for toxicity, societal bias, and language limitations. We would love to collaborate with the community to build safer and better models.

The SuperAGI AI Team

Anmol Gautam, Arkajit Datta, Rajat Chawla, Ayush Vatsal, Sukrit Chatterjee, Adarsh Jha, Abhijeet Sinha, Rakesh Krishna, Adarsh Deep, Ishaan Bhola, Mukunda NS, Nishant Gaurav.

Quantized from