Model returns entire input prompt together with output

#43
by andee96 - opened

Hey everyone,
apologies if this is a silly question, I am a bit new to this. I've started playing around with falcon-40b-instruct and I have noticed that regardless of what prompt I give it, it always returns the entire prompt as well as the output.

Example:
Prompt: "User: Hello, how are you?\n Assistant:"
Generated text: "User: Hello, how are you?\n Assistant: I'm fine, how can I help you?"

This makes it pretty difficult to chain prompts together using langchain. Is this how the model is supposed to behave? If no then what do people think I'm doing wrong? If yes then what is the most appropriate way to handle this?

I've deployed falcon-40b-instruct on sagemaker using the template provided by hugging face.

Thank you in advance :)

model.generate(
text=["def fibonnaci(", "User: How are you doing? Bot:"],
max_length=64,
include_prompt_in_result=False

Add "include_prompt_in_result=False" in model.generate(

lol that does make me feel pretty silly, i will give that a try.
Do you know where I am supposed to pass this parameter in the case where i've deployed the model using aws sagemaker?

I tried what you suggested in the following way:

instance_type = "ml.g4dn.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

model_name = "falcon-40b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(model_name)


# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize,
  'HF_TASK': 'text-generation'
}

model = HuggingFaceModel(
    name=model_name,
    role=role,
    image_uri=image_uri,
    env=config,
)
predictor = model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  endpoint_name=model_name,
)

input_data = {
  "inputs": "User: Hello, how are you?\n Assistant:",
  "parameters": {
    "do_sample": True,
    "top_k": 1,
    "max_length": 100,
    "include_prompt_in_result": False
  }
}

And unfortunately I still get the following response:
[{'generated_text': "User: Hello, how are you?\n Assistant: I'm fine, how can I help you?"}]

Use "return_full_text": False in the parameters to resolve this issue. Thank me later :)

That's amazing, thank you! Can confirm that this worked! Follow-up question, I am still getting the output returned when I try to use langchain with the deployed endpoint. I would have thought that passing include_prompt_in_result=False to the model_kwargs parameter would do the trick but that does not seem to be the case.

from langchain import SagemakerEndpoint
llm = SagemakerEndpoint(
        endpoint_name=predictor.endpoint_name, 
        credentials_profile_name="dev", 
        region_name="eu-west-2", 
        model_kwargs={"temperature":0.7, "max_length": 1024, "return_full_text": False},
        content_handler=content_handler
)

However, if i use this llm in any chain, the initial prompt gets returned again... Any clue what I am doing wrong here? :) @vaidyank

@andee96 I have falcon-40b deployed on sagemaker and I use

"return_full_text": false

To get this to stop doing the behavior your describing. The inference container is written in rust it seems like and when it does json serialization it might not like False.

LMK if that works!

I am facing same issue and tried using return_full_text and include_prompt_in_result both it is not working

Same issue here.Because of the returned instructions in the chain the second chain is producing the similar output as first chain.

Has anyone solved the issue with LangChain ?

Sign up or log in to comment