How to have a continuous conversation

#19

by sucongCJS - opened Feb 23, 2024

Feb 23, 2024

thanks for your amazing work!
according to your script, we can only have one input, if i want to ask the model more than one question, what should i do?
if I make more than one input, the answer is completely irrelevant to the image...
here is my experiment,

prompt = "USER: <image>\nwhat is the image about\nASSISTANT:"
raw_image = Image.open("/home/ubuntu/code/textual_inversion/zzz/sea.jpg")
inputs = processor(prompt, raw_image, return_tensors="pt").to(device, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

the image i provided:

output (which is normal):
ER:
what is the image about
ASSISTANT: The image features a large body of water with a few boats scattered throughout the scene. The water appears to be calm and serene, with a few sailboats and a yacht visible in the distance. The sky above the water is clear and blue, creating a picturesque view of the ocean. The boats are positioned at various distances from each other, adding depth and interest to the scene.

the second input, which has no image, I want the model to answer the question refer to the image i provided before.

prompt = "USER: is the image positive? can you describe the image again?\nASSISTANT:"
inputs = processor(prompt, return_tensors="pt").to(device, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

output (which is irrelevant to the image):
ER: is the image positive? can you describe the image again?
ASSISTANT: The image is a positive image of a human brain. It is a close-up view of the brain, showing its intricate structure and details. The image is in black and white, which adds to the dramatic and artistic nature of the photograph. The brain is the main subject of the image, and it is the focal point of the photograph.

ggcristian

Apr 21, 2024

Hi, @sucongCJS

Were you able to get a way to do this by script?

nielsr

Llava Hugging Face org Apr 22, 2024

In that case, you should append the previous message + image to the prompt, before feeding it back to the model

ggcristian

Apr 22, 2024

Thanks @nielsr .

Yes, I just tested this with some conversation loop that just keeps adding USER and ASSISTANT past queries and it worked well.

prakashshubham

Jun 21, 2024

Hi @ggcristian
Can you share the code for the same? I am not getting how to do that.

Yao-Lirong

Jul 3, 2024

•

edited Jul 3, 2024

Can you share the code for the same? I am not getting how to do that.

Hi @prakashshubham , this is what I did to conduct a multi-round conversation and hope you can find this helpful.

queries = [
   "<image>\nHow many animated characters are there in this image?",
   "Answer with a single number in decimal format. Give no explanations."
]

def generate_response(image):
    chat = []
    for query in queries:
        chat.append({"role": "user", "content": query})
        prompt = text_processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        inputs = processor(prompt, image, return_tensors="pt").to(device)

        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens = 300)
        output = processor.decode(output[0], skip_special_tokens=True)
        
        input_ids = inputs["input_ids"]
        cutoff = len(text_processor.decode(
                            input_ids[0],
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=True,
                        ))
        answer = output[cutoff:]
        chat.append({"role": "assistant", "content": answer})
    return answer

prakashshubham

Jul 3, 2024

Can you share the code for the same? I am not getting how to do that.

Hi @prakashshubham , this is what I did to conduct a multi-round conversation and hope you can find this helpful.

queries = [
   "<image>\nHow many animated characters are there in this image?",
   "Answer with a single number in decimal format. Give no explanations."
]

def generate_response(image):
    chat = []
    for query in queries:
        chat.append({"role": "user", "content": query})
        prompt = text_processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        inputs = processor(prompt, image, return_tensors="pt").to(device)

        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens = 300)
        output = processor.decode(output[0], skip_special_tokens=True)
        
        input_ids = inputs["input_ids"]
        cutoff = len(text_processor.decode(
                            input_ids[0],
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=True,
                        ))
        answer = output[cutoff:]
        chat.append({"role": "assistant", "content": answer})
    return answer

I was later able to do it myself. But still, thanks for this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment