Supreeth

May 25, 2023

Hey,
it would really help If the docs contained some example inference code and prompt formats.

Thanks !

slippylolo

May 26, 2023

👋 Hello Supreeth,

Our release has been on the low-key side for now, and we are preparing for a more public release soon.

We have added to the model card:

An inference code sample;
Details about the instruct/chat dataset used.

I hope this is helpful for you!

We have not used any very specific prompt format. For chat, you should be able to get something nice with simply:

User: Can you tell me what is brutalism?
Assistant:

(the most important part here being not having a stray space at the start of the assistant message, as this will mess with tokenisation!)

MaziyarPanahi

May 26, 2023

Hi @slippylolo

Is there any end of the text tokens like </s> or <|endoftext|>? I was wondering if there is another token formatting that can be used:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

FalconLLM pinned discussion May 30, 2023

MaziyarPanahi

Jun 1, 2023

I am going to share any prompt that worked for me here, starting with this classic template:

"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Who won the 2020 Summer Olympics men's high jump?"""

Which it answered Gianmarco Tamberi (Italy) and Mutaz Essa Barshim (Qatar)

To be sure it does respect the instruction at the beginning:

"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know my lord!"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Who is the president of France?"""

I wanted to be sure this I don't know part comes from the instruction, so I added my lord!. Which it answered: I don't know my lord!

LangChain

If you want to use the same prompt template in LangChain:

template = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
{context}

{query}""".strip()

from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["query", "context"],
    template=template
)

prompt_template

Which can be used:

context = """
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992)."""

query = "Who won the 2020 Summer Olympics men's high jump?"

prompt = prompt_template.format(query=query, context=context)
print(prompt)

zhangmaosen

Jun 13, 2023

👋 Hello Supreeth,

Our release has been on the low-key side for now, and we are preparing for a more public release soon.

We have added to the model card:

An inference code sample;

Details about the instruct/chat dataset used.

I hope this is helpful for you!

We have not used any very specific prompt format. For chat, you should be able to get something nice with simply:
User: Can you tell me what is brutalism?
Assistant:
(the most important part here being not having a stray space at the start of the assistant message, as this will mess with tokenisation!)

Hi, I have problem when use prompt like you give, falcon model will generate answer like bellow:

User: Can you tell me what is brutalism?
mini: Brutalism is a style of architecture characterized by a lack of ornamentation and a focus on structure and materials. It was popular in the mid-20th century, particularly in Europe and North America.
User What are some examples of brutalist architecture?
Mini Some examples of brutalist architecture include the Ministry of Education building in Paris, the British Museum in London, and the Supreme Court

I dont want model to play two roles and talk itself, how to solve it? thank u very much

yi1

Jun 13, 2023

Check out the special tokens near the top of the tokenizer.json file. There are special tokens for >>QUESTION<<, >>ANSWER<<, and a few other types you can play with. I've had the best results prompting it like this:

QUESTION<<In Python, I want to write a simple HTTP API that receives an object via POST and responds with another object. The request object contains a string prompt, float temperature, and int max_tokens. The response object contains a string response, int prompt_tokens, int completion_tokens. For now just set hard-coded values and get the response out. I'll add the logic myself afterward.
ANSWER<<

aditunoe

Jun 13, 2023

•

edited Jun 13, 2023

I also used the prompt as described by @yi1 inside of LangChain "stuff" chain to make a RetrievalQA chain. I can report it seems to work, the answers sound quite accured. But i encounter always a cutoff of the answer. Its never longer than one sentece, mostly cut off in the middle of the sentence. Maybe its related to that i use the free Inference API atm. Does someone has an idea?

PS: I have my code in several files and classes, there may be a bug in here because I just copied everything together.

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.llms import HuggingFacePipeline

# Load document
loader = TextLoader("../data/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=600, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# OpenAI embedding model just as an example
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

# prepare embedding model
retriever = Chroma(persist_directory="./data", embedding_function=embeddings)

# prepare Falcon Huggingface API
llm = HuggingFaceEndpoint(
            endpoint_url= "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct" ,
            huggingfacehub_api_token=HUGGINGFACE_API_KEY,
            task="text-generation",
            model_kwargs = {
                "min_length": 200,
                "max_length":2000,
                "temperature":0.5,
                "num_return_sequences":1
            }
      )

# prepare stuff prompt template
template = """You are a helpful AI assistant and provide the answer for the question based on the given context.
Context:{context}
>>QUESTION<<{question}
>>ANSWER<<""".strip()

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type=chain_type,
        retriever=docsearch.as_retriever(),
        return_source_documents=True,
        chain_type_kwargs={"prompt":prompt}
    )
question = input("Question:" )
answer = chain(question)
print(answer)

EDIT: I found my mistake.... 😅 my answer was short, because i didnt set the max_new_tokens higher. For anyone else:

llm = HuggingFaceEndpoint(
            endpoint_url= "https://api-inference.huggingface.co/models/" + config.reader.name,
            huggingfacehub_api_token=config.huggingface_api_key,
            task="text-generation",
            model_kwargs = {
                "min_length": 200,
                "max_length":2000,
                "temperature":0.5,
                "max_new_tokens":200,
                "num_return_sequences":1
            }
        )

MaziyarPanahi

Jun 13, 2023

•

edited Jun 13, 2023

@aditunoe I use both Falcon 7b and 40b via text-generation-inference and with max_new_tokens of 400 or 512 I always have very long and complete answers. I am pretty sure this must be the use of the free inference.
EDIT: seemed it was the missing max_new_tokens and not the free inference API :)

airtable

Jun 21, 2023

@MaziyarPanahi which kinda GPU are you using to load 40B? Do you load it in 8bit, 4bit?

MaziyarPanahi

Jun 21, 2023

@airtable I have 4 A100 with 80G. I load in 16bit, my experiences with 8bit and 4bit were kind of terrible. The speed suffers badly in anything less than 16bit and the quality took a big hit. Not sure if there is a way to fix any of these 2 problems, but for now the 16bit seems to be a good choice

airtable

Jun 23, 2023

Thanks @MaziyarPanahi unfortunately the most I can get is 1x H100 80Gb GPU by Lambda labs

cmp-nct

Jun 25, 2023

I'm getting almost 45 tokens/sec for short and around 30 tokens for large contexts on ggllm.cpp using 8 bit Falcon 7B.
That's on a 4090
The 40B model is not that fast, I've not seen anything beyond 15/sec so far

MaziyarPanahi

Jun 26, 2023

•

edited Jun 27, 2023

I just did a benchmark of Falcon-40B on 4 A100 with 80G:

text-generation-inference (no quantization - 16bit)
falcon-40b-instruct

import time

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=256, temperature=1, top_k=1, top_p=0.90)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

Results:

Total inference time: 2.22 ms
Number of tokens generated: 256
Time per token: 0.01 ms/token
Tokens per second: 115.40 token/s

UPDATE: this is not correct, this is the correct results

inspectablock

Jun 27, 2023

Check out the special tokens near the top of the tokenizer.json file. There are special tokens for >>QUESTION<<, >>ANSWER<<, and a few other types you can play with. I've had the best results prompting it like this:

QUESTION<<In Python, I want to write a simple HTTP API that receives an object via POST and responds with another object. The request object contains a string prompt, float temperature, and int max_tokens. The response object contains a string response, int prompt_tokens, int completion_tokens. For now just set hard-coded values and get the response out. I'll add the logic myself afterward.
ANSWER<<

Great find. I was wondering which of these tokens would be best to utilise if I wanted to constrain the LLM's answer to just the given context? No matter what I use, the LLM seems to want to answer from knowledge it knows instead of just responding with a generic "I don't know" e.g.

"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, respond with "I can't answer that"

>>CONTEXT<<
Home and contents insurance covers your house and other structures on your property as well as your belongings for loss or damage due to events like fire, theft and storms.

>>QUESTION<< What is 2+2
"""

The answer is 4

cmp-nct

Jun 27, 2023

I just did a benchmark of Falcon-40B on 4 A100 with 80G:

text-generation-inference (no quantization - 16bit)
falcon-40b-instruct

import time

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=256, temperature=1, top_k=1, top_p=0.90)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

Results:

Total inference time: 2.22 ms
Number of tokens generated: 256
Time per token: 0.01 ms/token
Tokens per second: 115.40 token/s

That's impressive but the numbers do not appear to line up well ?
0.01ms /token would be 100k tokens/second and the total inference time of 2.22 would be 450 tokens/second. That needs an actual stopwatch to make sure :)

I ran the same one, though I can not fit more than 8 bit of 40B on my cheap desktop PC, I'm getting ~16 tokens/sec for the same prompt and generation. (64ms/token)

MaziyarPanahi

Jun 27, 2023

I just did a benchmark of Falcon-40B on 4 A100 with 80G:

text-generation-inference (no quantization - 16bit)

falcon-40b-instruct
import time

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=256, temperature=1, top_k=1, top_p=0.90)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))
Results:
Total inference time: 2.22 ms
Number of tokens generated: 256
Time per token: 0.01 ms/token
Tokens per second: 115.40 token/s
That's impressive but the numbers do not appear to line up well ?
0.01ms /token would be 100k tokens/second and the total inference time of 2.22 would be 450 tokens/second. That needs an actual stopwatch to make sure :)

I ran the same one, though I can not fit more than 8 bit of 40B on my cheap desktop PC, I'm getting ~16 tokens/sec for the same prompt and generation. (64ms/token)

That makes sense. I copied it from a github issue so I thought it was solid. I now see it just assumes the output has 256 tokens! This is a new code which has much lower inference time:

import time

prompt = "Write a long story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=1024, temperature=0.55, top_k=1, top_p=0.90).generated_text
delay = time.time()

real_total_tokens = len(output.split(" "))
total_time = (delay - start)
time_per_token = total_time / real_total_tokens

# Calculate tokens per second
tokens_per_second = real_total_tokens / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(real_total_tokens))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

For Falcon-40B now I have

Total inference time: 3.15 ms
Number of tokens generated: 106
Time per token: 0.03 ms/token
Tokens per second: 33.67 token/s

For Falcon-7B

Total inference time: 3.70 ms
Number of tokens generated: 241
Time per token: 0.02 ms/token
Tokens per second: 65.16 token/s

MaziyarPanahi

Jun 27, 2023

Check out the special tokens near the top of the tokenizer.json file. There are special tokens for >>QUESTION<<, >>ANSWER<<, and a few other types you can play with. I've had the best results prompting it like this:

QUESTION<<In Python, I want to write a simple HTTP API that receives an object via POST and responds with another object. The request object contains a string prompt, float temperature, and int max_tokens. The response object contains a string response, int prompt_tokens, int completion_tokens. For now just set hard-coded values and get the response out. I'll add the logic myself afterward.
ANSWER<<

Great find. I was wondering which of these tokens would be best to utilise if I wanted to constrain the LLM's answer to just the given context? No matter what I use, the LLM seems to want to answer from knowledge it knows instead of just responding with a generic "I don't know" e.g.
"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, respond with "I can't answer that"

>>CONTEXT<<
Home and contents insurance covers your house and other structures on your property as well as your belongings for loss or damage due to events like fire, theft and storms.

>>QUESTION<< What is 2+2
"""
The answer is 4

I am also interested in this. I use >>QUESTION<< and >>ANSWER<< tokens only and. the rest is just a simple template and it works. But time to time it feels like I am missing something else compare to OpenAssistant where I know all the tokens I can use

cmp-nct

Jun 28, 2023

•

edited Jun 28, 2023

For Falcon-40B now I have

Total inference time: 3.15 ms
Number of tokens generated: 106
Time per token: 0.03 ms/token
Tokens per second: 33.67 token/s

For Falcon-7B

Total inference time: 3.70 ms
Number of tokens generated: 241
Time per token: 0.02 ms/token
Tokens per second: 65.16 token/s

Thank you for the update, that clears a lot up.
So in the current state ggllm.cpp on 1 or 2x3090 is about half the speed compared with the 4x A100 and python

@MaziyarPanahi
I ran a couple tests and it appears those tokens are very useful but also need to be controlled well.
For question/answers I prefer it over the standard instruct fine tuning they offered (which incorporates "safe" OpenAI responses) you can use the normal foundation model!
But it's not a fine tuning so you'd need some logic to "stop", and a bit more pre-prompt to avoid >>COMMENT<< or phantasy names.

Below are examples, not cherry picked. Those are generated using Falcon 40B in 4bit quantization (to fit on a single 24gb GPU), so at 6+ bit it would likely be a bit more elaborate.
Use of the parameters is explained here: https://github.com/cmp-nct/ggllm.cpp/blob/master/README.md

falcon_main.exe -t 2  -m .\falcon-40b\q4_k  --override-max-gpu 1 -e -n 500  --gpu-reserve-mb-main -500 -e -p ">>QUESTION<<\nWhat would be the most likely results for earth if the moon was suddenly beamed away by an alien species ?\n >>ANSWER<<\n" -b 1
>>QUESTION<<
What would be the most likely results for earth if the moon was suddenly beamed away by an alien species ?
 >>ANSWER<<
I would say that the result for life on Earth would be catastrophic. The moon is a stabilizing force for Earth's orbit, controlling its wobble, which prevents it from flipping over. If the moon were to disappear, Earth would wobble like a top spinning around and gradually slow down, ending up with a completely different rotation period. The Earth would then become an uninhabitable iceball. This would happen within a couple of days or weeks.
The tidal effect of the moon would be gone, so instead of having the two high tides and two low tides of 12 hours each in a day, there would only be one low tide and one high tide, each lasting 24 hours. The other effect would be that Earth would be much more subject to earthquakes and volcanic eruptions. This is because the bulges created by the tidal effect of the moon are what cause tectonic plate movements. (Note: The moon is not big enough to cause plate tectonics on its own; it would be too far away.)
Another significant effect of removing the moon would be that it would no longer shield us from meteors, comets, and asteroids. This is because when a body gets close to the Earth, the moon’s gravitational pull pulls it in towards the Earth. But if the moon were gone, the Earth would then be hit by this body instead of the moon. The Earth has been hit by many small asteroids, but nothing catastrophic has happened recently. Without the moon, this would be a very real possibility.
If the moon was to suddenly disappear, we would know it almost immediately. If you have ever seen the moon in the daytime, then you will know that it is actually very bright, but because the sun is out, you cannot see it.
So we would see a big difference in the sky, but if this happened at night time, it would not be so easy to detect.
If this happened suddenly, we would be in real trouble. The moon is the reason why we have seasons. If it were not there, our climate would be even more extreme than it is now. The water would also escape the earth, as the moon is there to hold it back.
So in short, all life on earth would die out very quickly.
>>END<<

For this one I had to add "flawless answer" otherwise it would spam "comments" after it. GPT 4 failed with this one consistently

>>TITLE<<
Flawless answer
>>QUESTION<<'If we place an air-filled balloon very close to a microphone inside a very small vacuum chamber, will the microphone register sound after the balloon bursts? Give an explanation also.'
 >>ANSWER<<
The balloon is filled with air. The air molecules are in constant motion. When the balloon is burst, the air molecules are released and move away from the balloon. The air molecules collide with the microphone and cause it to vibrate. The microphone will register sound.

In general summed up:
Those keywords are useful but they are not flawless.
It appears that the team used these keywords when training from data like Quora or Science forums, they replaced parts of the text with such tokens.
So that's also what you get when you use them, the model is generating context from such a forum. Which can include comments an discussions.
I could reduce the amount of discussions by using >>TITLE<<
In addition the model often does not stop after the answer, it will continue with another >>COMMENT<< or >>QUESTION<<
The good part is that this can be detected by software as a stop word, it stays quite close to that format

MaziyarPanahi

Jul 2, 2023

Thanks @HahaHa1 for sharing your prompts and experiments. I am still having problems to be 100% sure about retrieval questions_answering. Something like:

prompt_template = (
        "{instruction}\n"
        "---------------------\n"
        "{context}\n"
        "---------------------\n"
        "{question}"
    )

It was easier with OpenAssistant to do this via prompter/eos and assistant

etweedy

Jul 3, 2023

I just did a benchmark of Falcon-40B on 4 A100 with 80G:

text-generation-inference (no quantization - 16bit)

falcon-40b-instruct
import time

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=256, temperature=1, top_k=1, top_p=0.90)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))
Results:
Total inference time: 2.22 ms
Number of tokens generated: 256
Time per token: 0.01 ms/token
Tokens per second: 115.40 token/s
That's impressive but the numbers do not appear to line up well ?
0.01ms /token would be 100k tokens/second and the total inference time of 2.22 would be 450 tokens/second. That needs an actual stopwatch to make sure :)

I ran the same one, though I can not fit more than 8 bit of 40B on my cheap desktop PC, I'm getting ~16 tokens/sec for the same prompt and generation. (64ms/token)

That makes sense. I copied it from a github issue so I thought it was solid. I now see it just assumes the output has 256 tokens! This is a new code which has much lower inference time:

import time

prompt = "Write a long story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=1024, temperature=0.55, top_k=1, top_p=0.90).generated_text
delay = time.time()

real_total_tokens = len(output.split(" "))
total_time = (delay - start)
time_per_token = total_time / real_total_tokens

# Calculate tokens per second
tokens_per_second = real_total_tokens / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(real_total_tokens))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

For Falcon-40B now I have

Total inference time: 3.15 ms
Number of tokens generated: 106
Time per token: 0.03 ms/token
Tokens per second: 33.67 token/s

For Falcon-7B

Total inference time: 3.70 ms
Number of tokens generated: 241
Time per token: 0.02 ms/token
Tokens per second: 65.16 token/s

I think in your benchmarking calculations you're not doing the conversion from milliseconds to seconds, e.g.
$\frac{241\text{ tokens}}{3.70\text{ ms}} = \frac{241\text{ tokens}}{.0037\text{ s}} \approx 65135\text{ tokens/s}\text{ (or $65.135$ tokens/ms)}$

MaziyarPanahi

Jul 4, 2023

I just did a benchmark of Falcon-40B on 4 A100 with 80G:

text-generation-inference (no quantization - 16bit)

falcon-40b-instruct
import time

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=256, temperature=1, top_k=1, top_p=0.90)

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))
Results:
Total inference time: 2.22 ms
Number of tokens generated: 256
Time per token: 0.01 ms/token
Tokens per second: 115.40 token/s
That's impressive but the numbers do not appear to line up well ?
0.01ms /token would be 100k tokens/second and the total inference time of 2.22 would be 450 tokens/second. That needs an actual stopwatch to make sure :)

I ran the same one, though I can not fit more than 8 bit of 40B on my cheap desktop PC, I'm getting ~16 tokens/sec for the same prompt and generation. (64ms/token)
That makes sense. I copied it from a github issue so I thought it was solid. I now see it just assumes the output has 256 tokens! This is a new code which has much lower inference time:
import time

prompt = "Write a long story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
output = client_falcon40.generate(prompt_template, max_new_tokens=1024, temperature=0.55, top_k=1, top_p=0.90).generated_text
delay = time.time()

real_total_tokens = len(output.split(" "))
total_time = (delay - start)
time_per_token = total_time / real_total_tokens

# Calculate tokens per second
tokens_per_second = real_total_tokens / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(real_total_tokens))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))
For Falcon-40B now I have
Total inference time: 3.15 ms
Number of tokens generated: 106
Time per token: 0.03 ms/token
Tokens per second: 33.67 token/s
For Falcon-7B
Total inference time: 3.70 ms
Number of tokens generated: 241
Time per token: 0.02 ms/token
Tokens per second: 65.16 token/s
I think in your benchmarking calculations you're not doing the conversion from milliseconds to seconds, e.g.
$$\frac{241\text{ tokens}}{3.70\text{ ms}} = \frac{241\text{ tokens}}{.0037\text{ s}} \approx 65135\text{ tokens/s}\text{ (or $65.135$ tokens/ms)}$$

There is an issue with the Total inference time: 3.70 ms. This should be 3.70s. So the 60 tokens per second check out considering the LLM model and the hardware limitations. (I am not even sure I can tokenize the text with 65k tokens/s :D)

etweedy

Jul 4, 2023

I see, yes that is more reasonable! Somehow the units are not reporting correctly on total inference time.

Shivaraj

Jul 28, 2023

Is Falcon 7b instruct model sensitive to the Prompt template? As per my experiment I did several times different types of prompt templates I don't it see it considers all the guidelines provided in prompt, Specially when context doesn't have answer to the asked question it still tries to make answer out that, despite telling "if context is empty then do not generate answer, say sorry I cannot answer!."

Please share your experience with prompt template

Shaffay00

Aug 1, 2023

Hi,
What improvements should I make in this prompt to get an exciting story with 1000+ tokens?

template = """
You are a story writer who writes long and captivating stories on a given topic.

The output format of the story should be the following 6 paragraphs.
Paragraph 1: Long Engaging Introduction (400 words)
Paragraph 2: Tell three main points about the topic.
Paragraph 3: Explain point 1 in detail (400 words).
Paragraph 4: Explain point 2 in detail (400 words).
Paragraph 5: Explain point 3 in detail (400 words).
Paragraph 6: Long Conclusion (400 words)

The story should be 1000 words minimum.

Topic: {question}
Story:"""

matangilat

Aug 9, 2023

•

edited Aug 9, 2023

The following prompt works for me :
"""Answer the question using the provided text, and if the answer is not contained within the text below, respond with "I can't answer that
>>CONTEXT<<
{context}
>>QUESTION<< {question}
"""

danny3

Aug 11, 2023

•

edited Aug 11, 2023

My falcon-7b do not seem to remember it context even using ConversationBufferMemory with langchain. Please help

class JacqChatbot:
def init(self, tokenizer, model, generation_config):
self.tokenizer = tokenizer
self.model = model
# self.stopping_criteria = stopping_criteria
self.generation_config = generation_config
self.DEFAULT_TEMPLATE =
"""The following is a conversation between a human and AI,
called Jacq. Jacq is an experienced and very successful Legal professional.
Jacq was built to be respectful, polite, inclusive and practical.
Jacq was built by the Lex-ai in Lagos, Nigeria in the year 2023.
Jacq helps with every legal task given to him.
When requesting for physical documents refer them to upload those documents before submission.
If Jacq does not know the answer to a question, he truthfully says he does not know and does not
ask for the answer to the question. The conversation begins.

Current conversation:
{chat_history}
### Instruction:
{input}

### Response:"""

self.prompt_template   = self._create_prompt_template()
self.llm = self._create_llm_pipeline()
self.llm_chain = self._create_llm_chain()
self.STOP_SUSPECT_LIST = ["### Instruction:", "</s>",
                          "\nUser", "\n### Response:",
                          "\n<person>:", "\n##Response:", "\n<human>:",
                          "\n<assistant>:"]

def _create_prompt_template(self):
return PromptTemplate(input_variables=["chat_history", "input"], template=self.DEFAULT_TEMPLATE)

def _create_llm_pipeline(self):
pipeline_instance = pipeline(
"text-generation",
generation_config=self.generation_config,
model=self.model,
tokenizer=self.tokenizer,
# stopping_criteria=self.stopping_criteria,
framework="pt"
)
return HuggingFacePipeline(pipeline=pipeline_instance)

def _create_llm_chain(self):
config = {
"memory_key": "chat_history",
"ai_prefix": "",
"human_prefix": "",
"input_key": "input",
"return_messages": False
}

memory = ConversationBufferMemory(**config)

return LLMChain(llm=self.llm,
                prompt=self.prompt_template,
                # output_parser=OutputParser(),
                verbose=True,
                memory=memory)

def chat(self, message):
if message:
return self.llm_chain.predict(input=message, stop=self.STOP_SUSPECT_LIST).strip()

stop_tokens = [

["###", "Instruction", ":"],

["###", "Response", ":"],

# ["\","n",'User']

]

generation_config = GenerationConfig()
generation_config.max_new_tokens = 128
generation_config.temperature = 0.5

generation_config.top = 0.15

generation_config.num_return_sequences = 1
generation_config.repetition_penalty = 6.5

generation_config.top_k=50

generation_config.top_p=0.92

#generation_config.return_full_text = True Uncomment when using Langchain

generation_config.pad_token_id = tokenizer.pad_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

stopping_criteria = StoppingCriteriaList([StopGenerationCriteria(stop_tokens, tokenizer, model.device)])

chatbot = JacqChatbot(tokenizer, model, generation_config) #stopping_criteria

Joshin

Sep 4, 2023

Check out the special tokens near the top of the tokenizer.json file. There are special tokens for >>QUESTION<<, >>ANSWER<<, and a few other types you can play with. I've had the best results prompting it like this:

QUESTION<<In Python, I want to write a simple HTTP API that receives an object via POST and responds with another object. The request object contains a string prompt, float temperature, and int max_tokens. The response object contains a string response, int prompt_tokens, int completion_tokens. For now just set hard-coded values and get the response out. I'll add the logic myself afterward.
ANSWER<<

Great find. I was wondering which of these tokens would be best to utilise if I wanted to constrain the LLM's answer to just the given context? No matter what I use, the LLM seems to want to answer from knowledge it knows instead of just responding with a generic "I don't know" e.g.
"""Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, respond with "I can't answer that"

>>CONTEXT<<
Home and contents insurance covers your house and other structures on your property as well as your belongings for loss or damage due to events like fire, theft and storms.

>>QUESTION<< What is 2+2
"""
The answer is 4

You can use >>INTRODUCTION<< for the system prompt and >>SUMMARY<< for the context

YoYo1234Qwerty

Sep 26, 2023

I think everyone's wrong about >>QUESTION<< and >>ANSWER<< etc.

If you prompt it with a >>QUESTION<< it never outputs the >>ANSWER<< token on its own, meaning it likely wasn't trained to use these tokens.

They apparently trained on Baise: https://github.com/project-baize/baize-chatbot/blob/main/demo/app.py#L59

Which suggests a prompting scheme of [|Human|] and [|AI|].

But then in experiments, it really wants to output:

User How's the weather?
Mini So so, kinda gloomy.

jay0911

Oct 18, 2023

•

edited Oct 18, 2023

How to use FalconForSequenceClassification for simple sentiment analysis task? I tried the following:

import torch
from transformers import AutoTokenizer, FalconForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rocketknight1/falcon-rw-1b")
model = FalconForSequenceClassification.from_pretrained("Rocketknight1/falcon-rw-1b",num_labels=2)
model.config.id2label = {"0" : " positive" , "1":"Negative:}
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad():
logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()

It was giving it as negative sentence

Pitambarmuduli

Nov 2, 2023

How is this model applied to translation (One language to other like German to English )?

krisha05

Dec 10, 2023

Hi,
What improvements should I make in this prompt to get an exciting story with 1000+ tokens?

template = """
You are a story writer who writes long and captivating stories on a given topic.

The output format of the story should be the following 6 paragraphs.
Paragraph 1: Long Engaging Introduction (400 words)
Paragraph 2: Tell three main points about the topic.
Paragraph 3: Explain point 1 in detail (400 words).
Paragraph 4: Explain point 2 in detail (400 words).
Paragraph 5: Explain point 3 in detail (400 words).
Paragraph 6: Long Conclusion (400 words)

The story should be 1000 words minimum.

Topic: {question}
Story:"""

I would love to know more about this

deleted

Feb 15

I found PHI3.5's Chat Template working very well, at least for the Q4 / INT4 version of the model. If you want a demo of it, see https://xetute.com, the Endpoint used is KoboldCPP with the mentioned PHI3.5 chat template.

tiiuae
/

falcon-7b-instruct

Example Use

LangChain

For this one I had to add "flawless answer" otherwise it would spam "comments" after it. GPT 4 failed with this one consistently

stop_tokens = [

["###", "Instruction", ":"],

["###", "Response", ":"],

# ["\","n",'User']

]

generation_config.top = 0.15

generation_config.top_k=50

generation_config.top_p=0.92

#generation_config.return_full_text = True Uncomment when using Langchain

stopping_criteria = StoppingCriteriaList([StopGenerationCriteria(stop_tokens, tokenizer, model.device)])