core42/jais-13b · JAIS model response takes a lot of time

Sep 3, 2023

Hi, I'm attempting to execute the JAIS model on Colab utilizing 52 GB RAM. I've tested it on various GPUs, such as the A100, V100, and T4, in addition to the TPU. However, predictions are taking an inordinately long time based on the code snippet I ran. Could you assist me with this issue, please?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inception-mbzuai/jais-13b"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, offload_folder="offload")

def get_response(text,tokenizer=tokenizer,model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=200-input_len,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response

text= "عاصمة دولة |لإمارات العربية المتحدة ه"
print(get_response(text))

samta-kamboj

Core42 org Sep 4, 2023

We have tested it with 60GB RAM , it works fine. You may try loading it in lower precision as mentioned here

faris98

Sep 4, 2023

@samta-kamboj from where can i change it to lower precision ? Are there any parameters?

samta-kamboj

Core42 org Sep 6, 2023

You can use "torch_dtype" as

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True, torch_dtype=torch.float16)

alihkhawaher

Sep 11, 2023

I think you can load it with oobabooga if you have an 8GB GPU. Make sure to select load-in-4bit option to load it with less vram. I have 24GB gpu and the model is running with me. But I have a problem with token limitation.