Edit model card

SearchUnify/xgen-7b-8k-open-instruct-gptq

With its industry-first robust LLM Integrations across its suite of products (Cognitive Search, SUVA, Knowbler, Escalation Predictor, Agent Helper and Community Helper) coupled with the federated retrieval augmented generation (FRAG) architecture, SearchUnify's unified cognitive platform fetches relevant information or responses to deliver more accurate and contextually appropriate support and self-service experiences.

Leveraging the state-of-the-art GPTQ quantization method, SearchUnify optimized the XGen-7B Model for low memory footprint and rapid response generation.

These are GPTQ 4bit model files for VMWare's XGEN 7B 8K Open Instruct. It is the result of quantizing to 4bit using GPTQ-for-LLaMa.

How to use this GPTQ model from Python code

First, make sure you have AutoGPTQ installed:

pip install auto-gptq

Second, install tiktoken in order to use the tokenizer

pip install tiktoken

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "SearchUnify-ML/xgen-7b-8k-open-instruct-gptq"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                          use_fast=False,
                                          trust_remote_code=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
                                           model_basename=model_basename,
                                           use_safetensors=False,
                                           trust_remote_code=True,
                                           device="cuda:0",
                                           use_triton=use_triton)

# Note: check the prompt template is correct for this model.
prompt = "Explain the rules of field hockey to a novice."
prompt_template = f'''### Instruction: {prompt}
### Response:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.3, max_new_tokens=512)
print(f"\n\n {tokenizer.decode(output[0]).split('### Response:')[1]}")
Downloads last month
7
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train SearchUnify-ML/xgen-7b-8k-open-instruct-gptq