How can I run this?

#2
by Matan1905 - opened

Hey, I'm new to hugging face and I tried to create a GPU gradio space using the deploy button, but it doesn't work like it does in this main repo.
What am I missing?

OpenAssistant org

Juste use the HuggingFace text-generation client:

pip install text-generation
from text_generation import InferenceAPIClient

client = InferenceAPIClient("OpenAssistant/oasst-sft-1-pythia-12b")
text = client.generate("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>").generated_text
print(text)

# Token Streaming
text = ""
for response in client.generate_stream("<|prompter|>Why is the sky blue?<|endoftext|><|assistant|>"):
   if not response.token.special:
       print(response.token.text)
       text += response.token.text
print(text)
OpenAssistant org

It seems we have a small issue with the model we are stopping the api-inference support for now while we figure out what's wrong.

OpenAssistant org

Any estimated time when it will come back?

OpenAssistant org

The issue is quite deep in Transformers. You can track its evolution here: https://github.com/huggingface/transformers/issues/22161
While we find a solution, inference API will unfortunately stay off for this model.

OpenAssistant org

The issue should be fixed and the inference API is back online.
If you see any weird outputs were the model seem to always repeat the same token, please inform us here.

@olivierdehaene Is there a way to generate multiple sequence with client.generete call? (such as setting num_return_sequences parameter to a number greater than 1 in model.generate() call)

OpenAssistant org

You can use the best of parameter.
Or simply do multiple calls.

@olivierdehaene thanks for your reply. I can increase best of parameter up to 2. Another quesiton is, how can I increase the max length of generated text? It seems like the client.generate() method doesn't have "max_length" parameter.

OpenAssistant org
from text_generation import Client

Client.generate?

Gives you:

Signature:
Client.generate(
    self,
    prompt: str,
    do_sample: bool = False,
    max_new_tokens: int = 20,
    best_of: Optional[int] = None,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    truncate: Optional[int] = None,
    typical_p: Optional[float] = None,
    watermark: bool = False,
) -> text_generation.types.Response
Docstring:
Given a prompt, generate the following text

Args:
    prompt (`str`):
        Input text
    do_sample (`bool`):
        Activate logits sampling
    max_new_tokens (`int`):
        Maximum number of generated tokens
    best_of (`int`):
        Generate best_of sequences and return the one if the highest token logprobs
    repetition_penalty (`float`):
        The parameter for repetition penalty. 1.0 means no penalty. See [this
        paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.
    return_full_text (`bool`):
        Whether to prepend the prompt to the generated text
    seed (`int`):
        Random sampling seed
    stop_sequences (`List[str]`):
        Stop generating tokens if a member of `stop_sequences` is generated
    temperature (`float`):
        The value used to module the logits distribution.
    top_k (`int`):
        The number of highest probability vocabulary tokens to keep for top-k-filtering.
    top_p (`float`):
        If set to < 1, only the smallest set of most probable tokens with probabilities that add up to `top_p` or
        higher are kept for generation.
    truncate (`int`):
        Truncate inputs tokens to the given size
    typical_p (`float`):
        Typical Decoding mass
        See [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666) for more information
    watermark (`bool`):
        Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)

Returns:
    Response: generated response

You need to use the max_new_tokens parameter.

What's the best practice to have external context/input in this prompt format? How to convert something like this:

Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: Large Language Models (LLMs) are the latest models used in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers building NLP enabled applications. These models
can be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Spark NLP using the `spark-nlp` library.

Question: Which libraries and model providers offer LLMs?

Answer:

Sign up or log in to comment