Qwen/Qwen-1_8B-Chat-Int8 · Model doesnot run in Google Colab Free Tier

Install dependencies in the colab.

flash-attention is not supporterd in Google Colab free tier

!pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed -q
!pip install auto-gptq optimum -q

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen-7B', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat-Int8",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好！很高兴为你提供帮助。

# Qwen-1.8B-Chat现在可以通过调整系统指令（System Prompt），实现角色扮演，语言风格迁移，任务设定，行为设定等能力。
# Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
print(response)

It gives error as:

----> 9 response, history = model.chat(tokenizer, "你好", history=None)
     10 print(response)
     11 # 你好！很高兴为你提供帮助。

4 frames
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py in sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2758             # sample
   2759             probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2760             next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
   2761 
   2762             # finished sentences should have their next token be a padding token

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Above code runs without problem in Kaggle without any modification. Can you tell me whats the problem of not running in the Google Colab Free Tier?

Is there problem of Old version of GPU or other-thing?

Thank you.