Edit model card

Project page: https://gpt4vision.github.io/gpt4sgg/

GPT4SGG aims to synthesize a scene graph from global & localized narratives for an image. The model is fine-tuned on instruction-following data generated by GPT-4.

Usage demo:

import json

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("JosephZ/gpt4sgg-llama2-13b-int8", trust_remote_code=True, 
                                             load_in_8bit=True,
                                             device_map="cuda:0")

custom_input = [
    {"image_id": "101115", "width": 640, "height": 480, 
     "objects": ["person.1:[344, 374, 449, 480]", "person.2:[523, 274, 602, 479]", "person.3:[593, 366, 636, 474]",
                 "kite.4:[230, 85, 313, 137]", "kite.5:[461, 394, 529, 468]", "person.6:[456, 345, 512, 474]"], 
     "captions": {"global": "a group of people flying a kite in a field", 
                  "Union(person.2:[523, 274, 602, 479], person.3:[593, 366, 636, 474]) ; Union(person.2:[523, 274, 602, 479], kite.5:[461, 394, 529, 468])": "a woman in a dress", 
                  "Union(kite.5:[461, 394, 529, 468], person.6:[456, 345, 512, 474])": "a man with no shirt"}
     }]
custom_input = json.dumps(custom_input) # to str

prompt  = """[INST] <<SYS>>
You are a helpful AI visual assistant. Now, you are seeing image data. Each image provides a set of objects, and a set of captions for global and localized descriptions. 
<</SYS>> Extract relationship triplets from image data, each characterized by a unique "image_id", image dimensions,  a set of objects consisting of categories (formatted as "[category].[number]") and bounding boxes (in "xyxy" format). Each image data includes a global description for the entire image and localized descriptions for specific regions (notated as "Union(name1:box1, name2:box2)", keys with ";" in captions like "Union(name1:box1, name2:box2); Union(name3:box3, name4:box4)" refer to multiple union regions share the same caption).                
Here are the requirements for the task:
1. Process each image individually: Focus on one image at a time and give a comprehensive output for that specific image before moving to the next.
2. Infer interactions and spatial relationships: Utilize objects' information and both global and localized descriptions to determine relationships between objects(e.g., "next to", "holding", "held by", etc.).
3. Maintain logical consistency: Avoid impossible or nonsensical relationships (e.g., a person cannot be riding two different objects simultaneously, a tie cannot be worn by two persons, etc.).
4. Eliminate duplicate entries: Each triplet in the output must be unique and non-repetitive.
5. Output should be formatted as a list of dicts in JSON format, containing "image_id" and "relationships" for each image.
                
Example output: \n```
[
{"image_id": "123456",
 "relationships": [
    {"source": "person.1", "target": "skateboard.2", "relation": "riding"},
    {"source": "person.4", "target": "shirt.3", "relation": "wearing"},
    {"source": "person.2", "target": "bottle.5", "relation": "holding"},
    {"source": "person.4", "target": "bus.1", "relation": "near"},
  ]         
},
{"image_id": "23455",
 "relationships": [ 
    {"source": "man.1", "target": "car.1", "relation": "driving"}
  ]         
}           
]\n```             

Ensure that each image's data is processed and outputted separately to maintain clarity and accuracy in the relationship analysis.
                           
### Input:\n```\nPLACE_HOLDER\n```
[/INST]
### Output:\n```
"""

prompt = prompt.replace("PLACE_HOLDER", custom_input)



gen_cfg = GenerationConfig(do_sample=True, temperature=0.7, top_p=0.95)

prompts = [prompt] # batch prompts
inputs = tokenizer(prompts, padding=True, return_tensors='pt')
outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
                         generation_config=gen_cfg,
                         return_dict_in_generate=True,
                         max_length=4096, 
                         pad_token_id=tokenizer.eos_token_id,
                         )                          
outputs_text = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)
outputs_text = [e[len(prompt):] for e, prompt in zip(outputs_text, prompts)]

print("*"*10, " prompt:", prompt)
print("*"*10, " resp:", outputs_text[0])
Downloads last month
6
Safetensors
Model size
13B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.