Update README.md
Browse files
README.md
CHANGED
@@ -5,3 +5,84 @@ Project page: https://gpt4vision.github.io/gpt4sgg/
|
|
5 |
|
6 |
GPT4SGG aims to synthesize a scene graph from global & localized narratives for an image.
|
7 |
The model is fine-tuned on instruction-following data generated by GPT-4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
6 |
GPT4SGG aims to synthesize a scene graph from global & localized narratives for an image.
|
7 |
The model is fine-tuned on instruction-following data generated by GPT-4.
|
8 |
+
|
9 |
+
|
10 |
+
Usage demo:
|
11 |
+
```
|
12 |
+
import json
|
13 |
+
|
14 |
+
# Load model directly
|
15 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
16 |
+
|
17 |
+
tokenizer = AutoTokenizer.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True)
|
18 |
+
tokenizer.pad_token = tokenizer.eos_token
|
19 |
+
|
20 |
+
model = AutoModelForCausalLM.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True,
|
21 |
+
load_in_8bit=True,
|
22 |
+
device_map="cuda:0")
|
23 |
+
|
24 |
+
custom_input = [
|
25 |
+
{"image_id": "101115", "width": 640, "height": 480,
|
26 |
+
"objects": ["person.1:[344, 374, 449, 480]", "person.2:[523, 274, 602, 479]", "person.3:[593, 366, 636, 474]",
|
27 |
+
"kite.4:[230, 85, 313, 137]", "kite.5:[461, 394, 529, 468]", "person.6:[456, 345, 512, 474]"],
|
28 |
+
"captions": {"global": "a group of people flying a kite in a field",
|
29 |
+
"Union(person.2:[523, 274, 602, 479], person.3:[593, 366, 636, 474]) ; Union(person.2:[523, 274, 602, 479], kite.5:[461, 394, 529, 468])": "a woman in a dress",
|
30 |
+
"Union(kite.5:[461, 394, 529, 468], person.6:[456, 345, 512, 474])": "a man with no shirt"}
|
31 |
+
}]
|
32 |
+
custom_input = json.dumps(custom_input) # to str
|
33 |
+
|
34 |
+
prompt = """[INST] <<SYS>>
|
35 |
+
You are a helpful AI visual assistant. Now, you are seeing image data. Each image provides a set of objects, and a set of captions for global and localized descriptions.
|
36 |
+
<</SYS>> Extract relationship triplets from image data, each characterized by a unique "image_id", image dimensions, a set of objects consisting of categories (formatted as "[category].[number]") and bounding boxes (in "xyxy" format). Each image data includes a global description for the entire image and localized descriptions for specific regions (notated as "Union(name1:box1, name2:box2)", keys with ";" in captions like "Union(name1:box1, name2:box2); Union(name3:box3, name4:box4)" refer to multiple union regions share the same caption).
|
37 |
+
Here are the requirements for the task:
|
38 |
+
1. Process each image individually: Focus on one image at a time and give a comprehensive output for that specific image before moving to the next.
|
39 |
+
2. Infer interactions and spatial relationships: Utilize objects' information and both global and localized descriptions to determine relationships between objects(e.g., "next to", "holding", "held by", etc.).
|
40 |
+
3. Maintain logical consistency: Avoid impossible or nonsensical relationships (e.g., a person cannot be riding two different objects simultaneously, a tie cannot be worn by two persons, etc.).
|
41 |
+
4. Eliminate duplicate entries: Each triplet in the output must be unique and non-repetitive.
|
42 |
+
5. Output should be formatted as a list of dicts in JSON format, containing "image_id" and "relationships" for each image.
|
43 |
+
|
44 |
+
Example output: \n```
|
45 |
+
[
|
46 |
+
{"image_id": "123456",
|
47 |
+
"relationships": [
|
48 |
+
{"source": "person.1", "target": "skateboard.2", "relation": "riding"},
|
49 |
+
{"source": "person.4", "target": "shirt.3", "relation": "wearing"},
|
50 |
+
{"source": "person.2", "target": "bottle.5", "relation": "holding"},
|
51 |
+
{"source": "person.4", "target": "bus.1", "relation": "near"},
|
52 |
+
]
|
53 |
+
},
|
54 |
+
{"image_id": "23455",
|
55 |
+
"relationships": [
|
56 |
+
{"source": "man.1", "target": "car.1", "relation": "driving"}
|
57 |
+
]
|
58 |
+
}
|
59 |
+
]\n```
|
60 |
+
|
61 |
+
Ensure that each image's data is processed and outputted separately to maintain clarity and accuracy in the relationship analysis.
|
62 |
+
|
63 |
+
### Input:\n```\nPLACE_HOLDER\n```
|
64 |
+
[/INST]
|
65 |
+
### Output:\n```
|
66 |
+
"""
|
67 |
+
|
68 |
+
prompt = prompt.replace("PLACE_HOLDER", custom_input)
|
69 |
+
|
70 |
+
|
71 |
+
|
72 |
+
gen_cfg = GenerationConfig(do_sample=True, temperature=0.7, top_p=0.95)
|
73 |
+
|
74 |
+
prompts = [prompt] # batch prompts
|
75 |
+
inputs = tokenizer(prompts, padding=True, return_tensors='pt')
|
76 |
+
outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
|
77 |
+
generation_config=gen_cfg,
|
78 |
+
return_dict_in_generate=True,
|
79 |
+
max_length=4096,
|
80 |
+
pad_token_id=tokenizer.eos_token_id,
|
81 |
+
)
|
82 |
+
outputs_text = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)
|
83 |
+
outputs_text = [e[len(prompt):] for e, prompt in zip(outputs_text, prompts)]
|
84 |
+
|
85 |
+
print("*"*10, " prompt:", prompt)
|
86 |
+
print("*"*10, " resp:", outputs_text[0])
|
87 |
+
```
|
88 |
+
|