how to implement multiquery, FlashAttention and alibi.

#29
by NickyNicky - opened

Thanks for such a great model to be able to play with but I have some doubts

according to Falcon's card it says that it is used multiquery , FlashAttention and alibi.
I would like to know how to implement it in the code since it will not be shown in the examples, but rather basic ones

Positionnal embeddings: rotary (Su et al., 2021);
Attention: multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022);
Decoder-block: parallel attention/MLP with a single layer norm.

Example basic.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

example multiquery, FlashAttention and alibi.

????

thanks.

Technology Innovation Institute org

All implementation details are in modelling_RW.py

FalconLLM changed discussion status to closed

Sign up or log in to comment