AWS/MistralLite · Run it on Colab.

Oct 26, 2023

!pip -q install transformers==4.34.0 
!pip -q install accelerate==0.23.0
!pip -q install flash-attn==2.3.3 --no-build-isolation

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "amazon/MistralLite"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.bfloat16,
                                             offload_folder = "offload", 
                                             device_map="auto")
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer)

prompt = "<|prompter|>What are the main challenges to support a long context for LLM? Explain in details 1000-2000 words.</s><|assistant|>"

sequences = pipeline(
    prompt,
    max_new_tokens=5000,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

 for seq in sequences:
    print(f"{seq['generated_text']}")

girrajjangid

Oct 26, 2023

flash_attn v2 not supported on T4 GPU.

yinsong1986

Oct 26, 2023

Yes, in this case, we can run the model without flash_attn v2. Thank you!

amgadhasan

Nov 2, 2023

flash_attn v2 not supported on T4 GPU.

Also, T4 doesn't support bfloat16

yinsong1986

Nov 4, 2023

You can try to use float16, it should work as well. Cheers!

yinsong1986 changed discussion status to closed Dec 13, 2023