Model returns entire input prompt together with output

#43

by andee96 - opened Jun 12, 2023

Jun 12, 2023

Hey everyone,
apologies if this is a silly question, I am a bit new to this. I've started playing around with falcon-40b-instruct and I have noticed that regardless of what prompt I give it, it always returns the entire prompt as well as the output.

Example:
Prompt: "User: Hello, how are you?\n Assistant:"
Generated text: "User: Hello, how are you?\n Assistant: I'm fine, how can I help you?"

This makes it pretty difficult to chain prompts together using langchain. Is this how the model is supposed to behave? If no then what do people think I'm doing wrong? If yes then what is the most appropriate way to handle this?

I've deployed falcon-40b-instruct on sagemaker using the template provided by hugging face.

Thank you in advance :)

yi1

Jun 13, 2023

model.generate(
text=["def fibonnaci(", "User: How are you doing? Bot:"],
max_length=64,
include_prompt_in_result=False
）

yi1

Jun 13, 2023

Add "include_prompt_in_result=False" in model.generate(

andee96

Jun 13, 2023

lol that does make me feel pretty silly, i will give that a try.
Do you know where I am supposed to pass this parameter in the case where i've deployed the model using aws sagemaker?

andee96

Jun 13, 2023

I tried what you suggested in the following way:

instance_type = "ml.g4dn.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

model_name = "falcon-40b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(model_name)


# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize,
  'HF_TASK': 'text-generation'
}

model = HuggingFaceModel(
    name=model_name,
    role=role,
    image_uri=image_uri,
    env=config,
)
predictor = model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  endpoint_name=model_name,
)

input_data = {
  "inputs": "User: Hello, how are you?\n Assistant:",
  "parameters": {
    "do_sample": True,
    "top_k": 1,
    "max_length": 100,
    "include_prompt_in_result": False
  }
}

And unfortunately I still get the following response:
[{'generated_text': "User: Hello, how are you?\n Assistant: I'm fine, how can I help you?"}]

vaidyank

Jun 14, 2023

Use "return_full_text": False in the parameters to resolve this issue. Thank me later :)

andee96

Jun 14, 2023

That's amazing, thank you! Can confirm that this worked! Follow-up question, I am still getting the output returned when I try to use langchain with the deployed endpoint. I would have thought that passing include_prompt_in_result=False to the model_kwargs parameter would do the trick but that does not seem to be the case.

from langchain import SagemakerEndpoint
llm = SagemakerEndpoint(
        endpoint_name=predictor.endpoint_name, 
        credentials_profile_name="dev", 
        region_name="eu-west-2", 
        model_kwargs={"temperature":0.7, "max_length": 1024, "return_full_text": False},
        content_handler=content_handler
)

However, if i use this llm in any chain, the initial prompt gets returned again... Any clue what I am doing wrong here? :) @vaidyank

damontrp

Jun 14, 2023

@andee96 I have falcon-40b deployed on sagemaker and I use

"return_full_text": false

To get this to stop doing the behavior your describing. The inference container is written in rust it seems like and when it does json serialization it might not like False.

LMK if that works!