It could run with two 4090 or a single 6000 ADA, but its action is not so well.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
model_kwargs={"load_in_8bit": True}
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
But it doesn't behave so good like what the author said......
What's wrong with the code? doesn't act like a bot, but a parrot.
Falcon 40B needs at least around 90GB of VRAM to run, unfortunately, neither of the configurations provided in the title is matches this requirement. The community has however, quantised a version https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ, that might fit on the hardware you have available.
It can be run with two 3090/4090 in 8bit mode. I literately have run it correctly, but the result is not so good.
Could you offer an online demo?
Sorry, I did not understand your original post correctly. We have not validated the model in anything but bfloat16
, and you may be observing some degradation in model quality by quantising the model weights to 8bits, as you do in your code.