leonardPKU commited on
Commit
310593d
1 Parent(s): 4dd2f13

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md CHANGED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # YING-VLM
2
+
3
+ We open-sourced the trained checkpoint and infernce code of [YING-VLM](https://huggingface.co/MMInstruction/YingVLM) at huggingface, which is trained on [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) dataset.
4
+
5
+
6
+ # Example of Using YING-VLM
7
+
8
+ Please install the following packages:
9
+ - torch==2.0.0
10
+ - transformers==4.31.0
11
+
12
+
13
+
14
+ Infernce example:
15
+
16
+ ```python
17
+ from transformers import AutoProcessor, AutoTokenizer
18
+ from PIL import Image
19
+ import torch
20
+
21
+ from modelingYING import VLMForConditionalGeneration
22
+
23
+
24
+ # set device
25
+ device="cuda:0"
26
+
27
+ # set prompt template
28
+ prompt_template = """
29
+ <human>:
30
+ {instruction}
31
+ {input}
32
+ <bot>:
33
+ """
34
+
35
+ # load processor and tokenizer
36
+ processor = AutoProcessor.from_pretrained("MMInstruction/YingVLM")
37
+ tokenizer = AutoTokenizer.from_pretrained("MMInstruction/YingVLM") # ziya is not available right now
38
+
39
+
40
+ # load model
41
+ model = VLMForConditionalGeneration.from_pretrained("MMInstruction/YingVLM")
42
+ model.to(device,dtype=torch.float16)
43
+
44
+
45
+ # prepare input
46
+ image = Image.open("./imgs/night_house.jpeg")
47
+ instruction = "Scrutinize the given image and answer the connected question."
48
+ input = "What is the color of the couch?"
49
+ prompt = prompt_template.format(instruction=instruction, input=input)
50
+
51
+
52
+ # inference
53
+ inputs = processor(images=image, return_tensors="pt").to(device, torch.float16)
54
+ text_inputs = tokenizer(prompt, return_tensors="pt")
55
+ inputs.update(text_inputs)
56
+
57
+
58
+
59
+ generated_ids = model.generate(**{k: v.to(device) for k, v in inputs.items()}, img_num=1, max_new_tokens=128, do_sample=False)
60
+ generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].split("\n")[0] # \n is the end token
61
+
62
+ print(generated_text)
63
+ # The couch in the living room is green.
64
+
65
+
66
+
67
+
68
+
69
+ ```
70
+
71
+
72
+
73
+ # Refernce
74
+
75
+ If you find our work useful, please kindly cite
76
+ ```bib
77
+ @article{li2023m3it,
78
+ title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
79
+ author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
80
+ journal={arXiv preprint arXiv:2306.04387},
81
+ year={2023}
82
+ }
83
+ ```