JosephZ commited on
Commit
ba29bb3
1 Parent(s): 248d80c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md CHANGED
@@ -5,3 +5,84 @@ Project page: https://gpt4vision.github.io/gpt4sgg/
5
 
6
  GPT4SGG aims to synthesize a scene graph from global & localized narratives for an image.
7
  The model is fine-tuned on instruction-following data generated by GPT-4.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  GPT4SGG aims to synthesize a scene graph from global & localized narratives for an image.
7
  The model is fine-tuned on instruction-following data generated by GPT-4.
8
+
9
+
10
+ Usage demo:
11
+ ```
12
+ import json
13
+
14
+ # Load model directly
15
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
16
+
17
+ tokenizer = AutoTokenizer.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True)
18
+ tokenizer.pad_token = tokenizer.eos_token
19
+
20
+ model = AutoModelForCausalLM.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True,
21
+ load_in_8bit=True,
22
+ device_map="cuda:0")
23
+
24
+ custom_input = [
25
+ {"image_id": "101115", "width": 640, "height": 480,
26
+ "objects": ["person.1:[344, 374, 449, 480]", "person.2:[523, 274, 602, 479]", "person.3:[593, 366, 636, 474]",
27
+ "kite.4:[230, 85, 313, 137]", "kite.5:[461, 394, 529, 468]", "person.6:[456, 345, 512, 474]"],
28
+ "captions": {"global": "a group of people flying a kite in a field",
29
+ "Union(person.2:[523, 274, 602, 479], person.3:[593, 366, 636, 474]) ; Union(person.2:[523, 274, 602, 479], kite.5:[461, 394, 529, 468])": "a woman in a dress",
30
+ "Union(kite.5:[461, 394, 529, 468], person.6:[456, 345, 512, 474])": "a man with no shirt"}
31
+ }]
32
+ custom_input = json.dumps(custom_input) # to str
33
+
34
+ prompt = """[INST] <<SYS>>
35
+ You are a helpful AI visual assistant. Now, you are seeing image data. Each image provides a set of objects, and a set of captions for global and localized descriptions.
36
+ <</SYS>> Extract relationship triplets from image data, each characterized by a unique "image_id", image dimensions, a set of objects consisting of categories (formatted as "[category].[number]") and bounding boxes (in "xyxy" format). Each image data includes a global description for the entire image and localized descriptions for specific regions (notated as "Union(name1:box1, name2:box2)", keys with ";" in captions like "Union(name1:box1, name2:box2); Union(name3:box3, name4:box4)" refer to multiple union regions share the same caption).
37
+ Here are the requirements for the task:
38
+ 1. Process each image individually: Focus on one image at a time and give a comprehensive output for that specific image before moving to the next.
39
+ 2. Infer interactions and spatial relationships: Utilize objects' information and both global and localized descriptions to determine relationships between objects(e.g., "next to", "holding", "held by", etc.).
40
+ 3. Maintain logical consistency: Avoid impossible or nonsensical relationships (e.g., a person cannot be riding two different objects simultaneously, a tie cannot be worn by two persons, etc.).
41
+ 4. Eliminate duplicate entries: Each triplet in the output must be unique and non-repetitive.
42
+ 5. Output should be formatted as a list of dicts in JSON format, containing "image_id" and "relationships" for each image.
43
+
44
+ Example output: \n```
45
+ [
46
+ {"image_id": "123456",
47
+ "relationships": [
48
+ {"source": "person.1", "target": "skateboard.2", "relation": "riding"},
49
+ {"source": "person.4", "target": "shirt.3", "relation": "wearing"},
50
+ {"source": "person.2", "target": "bottle.5", "relation": "holding"},
51
+ {"source": "person.4", "target": "bus.1", "relation": "near"},
52
+ ]
53
+ },
54
+ {"image_id": "23455",
55
+ "relationships": [
56
+ {"source": "man.1", "target": "car.1", "relation": "driving"}
57
+ ]
58
+ }
59
+ ]\n```
60
+
61
+ Ensure that each image's data is processed and outputted separately to maintain clarity and accuracy in the relationship analysis.
62
+
63
+ ### Input:\n```\nPLACE_HOLDER\n```
64
+ [/INST]
65
+ ### Output:\n```
66
+ """
67
+
68
+ prompt = prompt.replace("PLACE_HOLDER", custom_input)
69
+
70
+
71
+
72
+ gen_cfg = GenerationConfig(do_sample=True, temperature=0.7, top_p=0.95)
73
+
74
+ prompts = [prompt] # batch prompts
75
+ inputs = tokenizer(prompts, padding=True, return_tensors='pt')
76
+ outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
77
+ generation_config=gen_cfg,
78
+ return_dict_in_generate=True,
79
+ max_length=4096,
80
+ pad_token_id=tokenizer.eos_token_id,
81
+ )
82
+ outputs_text = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)
83
+ outputs_text = [e[len(prompt):] for e, prompt in zip(outputs_text, prompts)]
84
+
85
+ print("*"*10, " prompt:", prompt)
86
+ print("*"*10, " resp:", outputs_text[0])
87
+ ```
88
+