Help: How to inference with this model in FP8

#43
by YukiTomita-CC - opened

I rewriting the code in the Instruct following section of the Mistral Inference as follows and executed it in Google Colab.

import torch # Added

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")
model = Transformer.from_folder(mistral_models_path, dtype=torch.float8_e4m3fn) # Changed

prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."

completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

When the model was loaded, VRAM usage was 11.6GB, but the following error occurs in the generate():

/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2262         # remove once script supports set_grad_enabled
   2263         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2264     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2265 
   2266 

RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
  • mistral_inference==1.3.1
  • torch==2.3.1+cu121
  • safetensors==0.4.3

Am I doing something wrong? If I could inference in FP8, it would fit in my 16GB VRAM. I would really appreciate if someone could teach me how to inference in FP8.

Sign up or log in to comment