Help with inference.

#1
by Jenish-23 - opened

Really like the model, gives relatively coherent responses for its size.
I was trying it with Question Answer task based on given text but, it spits out random sentences unrelated to the given text.
Please, give some tips to make it answer from the given text.
Also, what generation parameter values should I use so it always gives consistent response?

I've used the given prompt template and recommended inference parameters .

Hi @Jenish-23 ! Thank you trying out this model.

Before being able to help, I need to confirm a few things with you:

  • Which framework are you using to run the inference? Is it HuggingFace python library or Llama.cpp, MLX or some other lib that uses a converted version of this model? Asking because some frameworks won’t allow using Contrastive Search for inference, in this case, the combination of temperature, top P and top K will be necessary and won’t provide the same results as the one from the inference widget here in HF’s website.
  • Can you provide some example of question you’re inputting? So I can try and recommend you some parameters.
  • If you want to ask a question about some context, you can use a prompt like this (the use of “system” prompt is optional, so I’m not using in this example):
<|im_start|>user
Context:
- Most bug sprays you apply to your body are insect repellents, not insecticides. So, if you’re wondering, how does bug spray kill bugs? It doesn’t—it just keeps them away from you. Most bug sprays accomplish this goal primarily by disguising your scent.
- In the great outdoors, it would hardly be practical (or desirable) to kill all the bugs around us. Instead, we just want them to leave us alone. Bug repellent sprays work by making us invisible or unattractive to insects such as mosquitoes, ticks, biting flies and chiggers, encasing us in our own temporary protective barrier.
- Bug spray is a quick and easy option to solve pest problems. But it will only work for so long. Safeguard guarantees natural and organic solutions to keep those pests away for good!
- Having trouble catching the bugs in action? Spraying insecticides on your plants or trees where the bugs frequent is a sure way to get rid of them. Ingested insecticides are substances that, when consumed by bugs, attack the insect’s gut and kill them.
- Bug spray has two forms: insect repellent or insecticides. Insecticides are not safe to apply to the body and kill insects. Many are harmful to humans and animals, so they should be used carefully. Insecticides can repel bugs as well by having strong scents.

Question:
How does bugspray kills bugs?<|im_end|>
<|im_start|>assistant

Which the assistant should give this response:

Bug spray kills bugs by disguising their scent. This is done by disguising the scent of the bug, which is usually a mix of citrus and lemon scents. The bug spray kills the insect and leaves the body in a state of perpetual fear.

Insect repellent sprays work by making the insect's scent visible to predators. They do this by scenting the area around the insect and then spreading the odor to other areas of the body. This creates a sense of safety and protects the insect from harm.

Insect repellent sprays are not safe to apply to the body and kill insects.<|im_end|>

Thank you, for the detailed response. To answer your questions:

  • Which framework are you using to run the inference? Huggingface's transformers
  • some example of question you’re inputting: I am using few sentences from Wikipedia articles as context and simple retrieval questions nothing creative or generative.

It never keeps its focus on the context, always adds extra stuff or gives completely irrelevant answers.

I've tried few variations of generation parameters but only this one worked for general text generation tasks:

  • Recommended parameter from the model cards.
  • do_sample=True
  • temperature=0.3
  • repetition_penalty=1.1 or 1.15

Also, few things I noticed :

  • loading model in float32 gives better responses than f16, 8bit and 4bit. I mean coherent, long and little-to-no repetitions.
  • only one system_prompt gives relatively coherent responses: " You are a helpful assistant, who answers with empathy". Responses when using other system prompts as given in the HF inference API either don't work or maybe needs different set of parameters to go with.

Ah, good point. The model is really loaded in float32 on HF's inference endpoint/widget. Loading it with lower precision could affect it because, due to the small number of parameters, every detail counts.

But now I know you're using HF's transformers lib, we can narrow it down.

Another question would be: are you using tokenizer.encode() or passing the input directly through pipeline()? I remember tokenizer.encode() adds special tokens by default, and maybe it could be affecting the results.

Let me try getting some info from Wikipedia, use "pipeline()" on it, and then check the response. Will feedback here soon.

Only one system_prompt gives relatively coherent responses: " You are a helpful assistant, who answers with empathy". Responses when using other system prompts as given in the HF inference API either don't work or maybe needs different set of parameters to go with.

That's curious! Cause none of the training data has this system prompt :D I used it in the inference example exactly to confirm it was generalizing.

# My inference code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model_id = "Felladrin/Llama-160M-Chat-v1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt_template = f"""
<|im_start|>system
You are a helpful assistant, who answers with empathy.<|im_end|>
<|im_start|>user
Got a question for you!<|im_end|>
<|im_start|>assistant
Sure! What's it?<|im_end|>
<|im_start|>user
Why do you love cats so much!? <|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt_template, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.3, penalty_alpha: 0.5, top_k=5, repetition_penalty=1.1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I've tried few variations of generation parameters but only this one worked for general text generation tasks:

Recommended parameter from the model cards.
do_sample=True
temperature=0.3
repetition_penalty=1.1 or 1.15

Ah, if I'm not mistaken, at the moment you enable do_sample, it disables the Contrastive Search, so that could be the reason you're not getting good responses.

Let's use this as the starting point:

from transformers import pipeline

prompt = """<|im_start|>system
You are a helpful assistant, who answers with empathy.<|im_end|>
<|im_start|>user
Got a question for you!<|im_end|>
<|im_start|>assistant
Sure! What's it?<|im_end|>
<|im_start|>user
Why do you love cats so much!? <|im_end|>
<|im_start|>assistant"""

generate = pipeline("text-generation", "Felladrin/Llama-160M-Chat-v1")

output = generate(
    prompt,
    max_new_tokens=100,
    penalty_alpha=0.5,
    top_k=4,
    repetition_penalty=1.01,
)

print(output[0]["generated_text"])

It should be using CPU and taking a few seconds to reply the following:

Cats are incredibly intelligent creatures, and they're known for their ability to learn and adapt to new situations. They're also known for their loyalty, which is a trait that comes with a lot of responsibility. However, there are some people who don't like cats as much as they should, and they may be able to help you understand why.<|im_end|>

From here we can add CUDA and check if the response continues coherently. Then we can start modding the prompt.

Extra info:

If you still need to use Tokenizer to encode the inputs, you can set add_special_tokens=False, so it won't automatically add the BOS token at the beginning, which will lead to the same response as when using pipeline().

inputs = tokenizer(prompt_template, return_tensors="pt", add_special_tokens=False)
Jenish-23 changed discussion status to closed

Sign up or log in to comment