Supporting the model on text-generation-inference server

by rsalshalan - opened Sep 11, 2023

Sep 11, 2023

Hi all,
Thanks for the great efforts!

I would like to ask why cant i use the model with text-generation-inference

I tried to launch the server as follows text-generation-launcher --model-id data/jais-13b-chat (I downloaded the repo locally).

Here are the results:

023-09-11T11:02:01.949055Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 298, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type jais

Am I missing something?

Would appreciate your support and if you need any more details about this please let me know

alihkhawaher

Sep 12, 2023

It works with me but you need to use transformer loader. I do not know how to do this using command line

tonyAlapatt

Sep 15, 2023

•

edited Sep 15, 2023

I'm getting the same error. How do I run it on a server for inference ? Using TGI or anthing else ? Do help us with the necessary parameters.

alihkhawaher

Sep 15, 2023

•

edited Sep 15, 2023

I dont know what you mean but if you wants to load it with 4bit to work with low VRAM here what I am using

my windows,
https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64
python 3.11

then python environment (i.e python -m vevn venv)

pip install transformers, accelerate
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl
pip install torch==2.0.1+cu117 --index-url https://download.pytorch.org/whl/cu117

a working python, I have two GPUs that is why I am specifying cuda:0 because it has 24G vram

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "C:\AI\ML Models\inception-mbzuai_jais-13b"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,load_in_4bit=True, trust_remote_code=True,
                                             bnb_4bit_compute_dtype=torch.float16)

def get_response(text,tokenizer=tokenizer,model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=200-input_len,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    return response


text= "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))

alihkhawaher

Sep 15, 2023

Also you may need to use peft, I am not sure what it does but it solves some an error I was getting

pip install peft

# quantization_config
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.float16
)

from peft import prepare_model_for_kbit_training
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0", quantization_config=nf4_config,trust_remote_code=True)
model = prepare_model_for_kbit_training(model)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment