Implement vision with function calling?

#3
by tarruda - opened

Accoding to https://twitter.com/Teknium1/status/1731409499595194679, this model is not behaving as expected (I was really looking forward to trying it! :( ).

As a dummy LLM user which doesn't undertand what you're doing here I have a question: Instead of adding vision directly to the LLM, why not train it to emit a function call whenever the user prompts for an image to be analyzed?

For example, let's say we're building a ChatGPT clone which allows the user to upload images, instead of embedding directly the image URL or base64 in the conversation (which theoretically consumes context), the app could add an uuid of the image and embed something like this in the conversation:

[image:68ffdd0b-2e14-4a0c-a57f-e713164d1271]

What is in this image?

The uuid is a way for the application to locate the image in a database, filesystem or URL (assuming the user uploaded the image into the conversation, the app would have stored it somewhere). When the LLM sees something like this, generate a function call (with uuid as parameter) which the app can forward to a vision encoder model or external API and then inject the results back in the conversation.

NousResearch org
edited Dec 4, 2023

The vision encoder model ultimately has to emit embeddings (pseudo-text) for the model to interpret. Unless I'm unaware of some unique point of this model, the image isn't embedded as a URL or base64 -- it's embedded as image encoder tokens.

The vision encoder model ultimately has to emit embeddings (pseudo-text) for the model to interpret. Unless I'm unaware of some unique point of this model, the image isn't embedded as a URL or base64 -- it's embedded as image encoder tokens.

I see. Wouldn't it be possible to implement Vision by having a function call that generates a detailed text description for the LLM, which uses the description to answer any questions the user might have about the image? Here's an example of what I meant: https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B/discussions/3#656e177f85562996982218ef

NousResearch org

The vision encoder model ultimately has to emit embeddings (pseudo-text) for the model to interpret. Unless I'm unaware of some unique point of this model, the image isn't embedded as a URL or base64 -- it's embedded as image encoder tokens.

I see. Wouldn't it be possible to implement Vision by having a function call that generates a detailed text description for the LLM, which uses the description to answer any questions the user might have about the image? Here's an example of what I meant: https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B/discussions/3#656e177f85562996982218ef

I believe that is how Google Bard's vision works - it gets a detailed description from Lens, and uses that

yeah so its pretty accurate but it cant really get the full meaning of the image.

LLaVA data was generated like that. They feed GPT-4 the caption and metadata of the images then tell GPT-4 to generate a conversation out of those. This is a great way to get more data but also very vulnerable to hallucination

Sign up or log in to comment