import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "PerRing/llava-v1.6-vicuna-13b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)
processor = AutoProcessor.from_pretrained(model_id)

Q='explain about this image.'
prompt = f"""A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>
{Q} ASSISTANT:
"""
image_file = "https://images.pexels.com/photos/757889/pexels-photo-757889.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2"

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_length=256, temperature=0.4, do_sample=True)
print(processor.decode(output[0], skip_special_tokens=True))

result

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER:  
explain about this image. ASSISTANT:

This image shows a clear glass vase filled with water, which is placed on a surface that appears to be a balcony or a patio. Inside the vase, there are several purple flowers with ruffled petals, which could be a type of iris or a similar flower. The flowers are in full bloom, and their vibrant purple color stands out against the green leaves and stems. The background is blurred but suggests an outdoor setting with greenery, indicating that the flowers are likely in a garden or a balcony garden. The overall atmosphere of the image is serene and natural, with a focus on the beauty of the flowers.

Original(liuhaotian/llava-v1.6-vicuna-13b) README.md

# LLaVA Model Card

Model details

Model type: LLaVA is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. Base LLM: lmsys/vicuna-13b-v1.5

Model date: LLaVA-v1.6-Vicuna-13B was trained in December 2023.

Paper or resources for more information: https://llava-vl.github.io/

License

Where to send questions or comments about the model: https://github.com/haotian-liu/LLaVA/issues

Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT-generated multimodal instruction-following data.
500K academic-task-oriented VQA data mixture.
50K GPT-4V data mixture.
40K ShareGPT data.

Evaluation dataset

A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.