Supporting the model on text-generation-inference server
Hi all,
Thanks for the great efforts!
I would like to ask why cant i use the model with text-generation-inference
I tried to launch the server as follows text-generation-launcher --model-id data/jais-13b-chat
(I downloaded the repo locally).
Here are the results:
023-09-11T11:02:01.949055Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 298, in get_model
raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type jais
Am I missing something?
Would appreciate your support and if you need any more details about this please let me know
It works with me but you need to use transformer loader. I do not know how to do this using command line
I'm getting the same error. How do I run it on a server for inference ? Using TGI or anthing else ? Do help us with the necessary parameters.
I dont know what you mean but if you wants to load it with 4bit to work with low VRAM here what I am using
my windows,
https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64
python 3.11
then python environment (i.e python -m vevn venv)
pip install transformers, accelerate
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl
pip install torch==2.0.1+cu117 --index-url https://download.pytorch.org/whl/cu117
a working python, I have two GPUs that is why I am specifying cuda:0 because it has 24G vram
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "C:\AI\ML Models\inception-mbzuai_jais-13b"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,load_in_4bit=True, trust_remote_code=True,
bnb_4bit_compute_dtype=torch.float16)
def get_response(text,tokenizer=tokenizer,model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=200-input_len,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response
text= "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))
text = "The capital of UAE is"
print(get_response(text))
Also you may need to use peft, I am not sure what it does but it solves some an error I was getting
pip install peft
# quantization_config
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
from peft import prepare_model_for_kbit_training
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda:0", quantization_config=nf4_config,trust_remote_code=True)
model = prepare_model_for_kbit_training(model)