antonovmaxim's picture
fixed a bug (thanks to dorkai)
292c2df

A newer version of the Gradio SDK is available: 4.37.1

Upgrade

Multimodal

Description

Adds support for multimodality (text+images) to text-generation-webui.

https://user-images.githubusercontent.com/3718215/233817203-69b57e77-0c55-4fd6-b742-3204bb13b8fc.mp4

Usage

To run this extension, download a LLM that supports multimodality, and then start server.py with the appropriate --multimodal-pipeline argument. Examples:

python server.py --model wojtab_llava-7b-v0-4bit-128g --multimodal-pipeline llava-7b --chat
python3 server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --chat
python server.py --model anon8231489123_vicuna-13b-GPTQ-4bit-128g --multimodal-pipeline minigpt4-13b --chat
python server.py --model llama-7b-4bit --multimodal-pipeline minigpt4-7b --chat

There is built-in support for LLaVA-v0-13B and LLaVA-v0-7b. To install minigpt4:

The same procedure should be used to install other pipelines, which can then be used with --multimodal-pipeline [pipeline name]. For additional multimodal pipelines refer to the compatibility section below.

Do note, that each image takes up a considerable amount of tokens, so adjust max_new_tokens to be at most 1700 (recommended value is between 200 to 500), so the images don't get truncated.

To send an image, just upload it to the extension field below chat, and send a prompt as always. The image will be added to the end of your message. If you wish to modify the placement, include a string <image> in your prompt.

Additionally, there is Embed all images, not only the last one checkbox. It modifies the image embeddings, by default (if it's unchecked), all but the most recent images have their embeddings empty, so they are not fed to the network. It seems as if some multimodal networks consider the features in all images at the same time as if they were a single image. Due to this behavior, by default, the extension skips previous images. However, it can lead to sub-par generation on other pipelines. If you want to include all images, just tick this checkbox.

Compatibility

As of now, the following multimodal pipelines are supported:

Pipeline --multimodal-pipeline Default LLM LLM info(for the linked model) Pipeline repository
LLaVA 13B llava-13b LLaVA 13B GPTQ 4-bit quant, old CUDA built-in
LLaVA 7B llava-7b LLaVA 7B GPTQ 4-bit quant, old CUDA built-in
MiniGPT-4 7B minigpt4-7b Vicuna v0 7B GPTQ 4-bit quant, new format Wojtab/minigpt-4-pipeline
MiniGPT-4 13B minigpt4-13b Vicuna v0 13B GPTQ 4-bit quant, old CUDA Wojtab/minigpt-4-pipeline

Some pipelines could support different LLMs but do note that while it might work, it isn't a supported configuration.

DO NOT report bugs if you are using a different LLM.

DO NOT report bugs with pipelines in this repository (unless they are built-in)

Extension config

This extension uses the following parameters (from settings.json):

Parameter Description
multimodal-vision_bits Number of bits to load vision models (CLIP/ViT) feature extractor in (most pipelines should support either 32 or 16, default=32)
multimodal-vision_device Torch device to run the feature extractor on, for example, cpu or cuda:0, by default cuda:0 if available
multimodal-projector_bits Number of bits to load feature projector model(s) in (most pipelines should support either 32 or 16, default=32)
multimodal-projector_device Torch device to run the feature projector model(s) on, for example cpu or cuda:0, by default cuda:0 if available
multimodal-add_all_images_to_prompt Default value of "Embed all images, not only the last one" checkbox

Usage through API

You can run the multimodal inference through API, by inputting the images to prompt. Images are embedded like so: f'<img src="data:image/jpeg;base64,{img_str}">', where img_str is base-64 jpeg data. Note that you will need to launch server.py with the arguments --api --extensions multimodal.

Python example:

import base64
import requests

CONTEXT = "You are LLaVA, a large language and vision assistant trained by UW Madison WAIV Lab. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail.### Human: Hi!### Assistant: Hi there! How can I help you today?\n"

with open('extreme_ironing.jpg', 'rb') as f:
    img_str = base64.b64encode(f.read()).decode('utf-8')
    prompt = CONTEXT + f'### Human: What is unusual about this image: \n<img src="data:image/jpeg;base64,{img_str}">### Assistant: '
    print(requests.post('http://127.0.0.1:5000/api/v1/generate', json={'prompt': prompt, 'stopping_strings': ['\n###']}).json())

script output:

{'results': [{'text': "The unusual aspect of this image is that a man is standing on top of a yellow minivan while doing his laundry. He has set up a makeshift clothes line using the car's rooftop as an outdoor drying area. This scene is uncommon because people typically do their laundry indoors, in a dedicated space like a laundromat or a room in their home, rather than on top of a moving vehicle. Additionally, hanging clothes on the car could be potentially hazardous or illegal in some jurisdictions due to the risk of damaging the vehicle or causing accidents on the road.\n##"}]}

For pipeline developers/technical description

see DOCS.md