Supporting the model on text-generation-inference server

by rsalshalan - opened

Hi all,
Thanks for the great efforts!

I would like to ask why cant i use the model with text-generation-inference

I tried to launch the server as follows text-generation-launcher --model-id data/jais-13b-chat (I downloaded the repo locally).

Here are the results:

023-09-11T11:02:01.949055Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 81, in serve

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 184, in serve

  File "/opt/conda/lib/python3.9/asyncio/", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 298, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type jais

Am I missing something?

Would appreciate your support and if you need any more details about this please let me know

It works with me but you need to use transformer loader. I do not know how to do this using command line

I'm getting the same error. How do I run it on a server for inference ? Using TGI or anthing else ? Do help us with the necessary parameters.

I dont know what you mean but if you wants to load it with 4bit to work with low VRAM here what I am using

my windows,
python 3.11

then python environment (i.e python -m vevn venv)

pip install transformers, accelerate
pip install
pip install torch==2.0.1+cu117 --index-url

a working python, I have two GPUs that is why I am specifying cuda:0 because it has 24G vram

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "C:\AI\ML Models\inception-mbzuai_jais-13b"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,load_in_4bit=True, trust_remote_code=True,

def get_response(text,tokenizer=tokenizer,model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs =
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        min_length=input_len + 4,
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    return response

text= "عاصمة دولة الإمارات العربية المتحدة ه"

text = "The capital of UAE is"

Also you may need to use peft, I am not sure what it does but it solves some an error I was getting

pip install peft

# quantization_config
nf4_config = BitsAndBytesConfig(

from peft import prepare_model_for_kbit_training
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0", quantization_config=nf4_config,trust_remote_code=True)
model = prepare_model_for_kbit_training(model)

Sign up or log in to comment