Inference API

#3
by Shivkumar27 - opened

Hi @Omnibus ,
I want to use the Inference API of gemma-7b model, can you please share some documentation from where i can get some reference about how to pass my context and question in the body and what standard parameters should be passed to get desired output.

I am using Omnibus space and gave some big prompt and system prompt but it shows this Error
"Wait, that's too many tokens, please reduce the 'Chat Memory' value, or reduce the 'Max new tokens' value"
So how can i avoid this I checked in the files there we have this condition of len(prompt+system_prompt)>8000. Then return this text.
Please provide some solution to my query.
It would be really helpful.

Thanks & Regards.
Shiv Kumar

Documentation for model: https://huggingface.co/google/gemma-7b
This demo is using the huggingface_hub InferenceClient, and it is not a documented way of deploying these models, so they may not be optimized for it.

This model has a max token limit of 8192 tokens, which is the input + output.
This demo retains a number of previous [(input, output), (input, output)] conversations that are in the Chatbot window (Chat Memory).
These previous conversations + the new input + new output must be less than 8000 tokens, or the Error that you mentioned will be raised.

The default Chat Memory is 4 conversations, but this can be reduced to provide more tokens for new (input, output), at the expense of losing some context of the conversation.
Reducing the 'Max new tokens' value will tell the model to return less tokens in it's output, which will also help stay within the 8192 total token range.

To modify the parameters that are sent with the prompt + system prompt using this demos UI, there is an accordion labelled "Modify Prompt Format".
The default for the prompt parameters are: "<start_of_turn>userUSER_INPUT<end_of_turn><start_of_turn>model", where USER_INPUT will be replaced by the values in the "Prompt" + "System Prompt" input boxes.

Sign up or log in to comment