|
--- |
|
license: mit |
|
license_link: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/LICENSE |
|
|
|
language: |
|
- multilingual |
|
pipeline_tag: text-generation |
|
tags: |
|
- nlp |
|
- code |
|
- vision |
|
inference: |
|
parameters: |
|
temperature: 0.7 |
|
widget: |
|
- messages: |
|
- role: user |
|
content: <|image_1|>Can you describe what you see in the image? |
|
--- |
|
## Model Summary |
|
|
|
Phi-3 Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. |
|
|
|
Resources and Technical Documentation: |
|
|
|
+ [Phi-3 Microsoft Blog](https://aka.ms/Phi-3Build2024) |
|
+ [Phi-3 Technical Report](https://aka.ms/phi3-tech-report) |
|
+ [Phi-3 on Azure AI Studio](https://aka.ms/try-phi3vision) |
|
+ [Phi-3 Cookbook](https://github.com/microsoft/Phi-3CookBook) |
|
|
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoModelForCausalLM |
|
from transformers import AutoProcessor |
|
|
|
model_id = "Kukedlc/Phi-3-Vision-Win-snap" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto") |
|
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
|
messages = [ |
|
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"}, |
|
{"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."}, |
|
{"role": "user", "content": "Provide insightful questions to spark discussion."} |
|
] |
|
|
|
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
|
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0") |
|
|
|
generation_args = { |
|
"max_new_tokens": 500, |
|
"temperature": 0.0, |
|
"do_sample": False, |
|
} |
|
|
|
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) |
|
|
|
# remove input tokens |
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] |
|
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
|
|
print(response) |
|
``` |
|
|