Image in an arbitrary position in the input text (Plus multiple images as input)
Hello. First, I appreciate you sharing this great work.
By default, your model always inserts an image at the beginning of the input text. However, I wanted to use text interleaved with an image as an input, such as "This image {image comes here} is ...". So I made a wrapper class for the processor to accept an image in an arbitrary position in the text. Here's the code. When using this wrapped processor, you can use <image>
to specify the location of the image in the input text.
I tried this, and it seemed to work (it can't accept bounding boxes, though). Do you think this wrapper is reasonable and works fine?
Also, I want to insert multiple images in the input, but I haven't figured out how. Do you have any plans to release codes for multiple images?
Thank you in advance.
class ProcessorWrapper:
def __init__(self, processor):
self.processor = processor
def __call__(
self,
images = None,
text = None,
bboxes = None,
num_image_tokens = 64,
first_image_token_id = None,
add_special_tokens = True,
padding = False,
truncation = None,
max_length = None,
stride = 0,
pad_to_multiple_of = None,
return_attention_mask = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_token_type_ids: bool = False,
return_length: bool = False,
verbose: bool = True,
return_tensors = None,
**kwargs,
):
"""
Preprocess text and image for Kosmos-2 model.
Args:
text (str): The text to be encoded. <image> specifies the location of the image embeddings in the text.
"""
# add fake <image><image>...<image></image> to the text
# these tokens represent the location of the image embeddings
# the space at the end of suffix is necessary to match the original behavior
text = insert_images(text, num_image_tokens=num_image_tokens, suffix='</image> ')
text_encoding = self.processor.tokenizer(
text=text,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
stride=stride,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_token_type_ids=return_token_type_ids,
return_length=return_length,
verbose=verbose,
return_tensors=return_tensors,
**kwargs,
)
# find the start of the image tokens
input_ids = np.array(text_encoding['input_ids'])
# here, start_index shows the actual encoding position of the first image token
# don't forget to add 1 for the first <image>
start_index = np.where(input_ids[0]==64003)[0][0] + 1
# Replace fake <image> tokens with range
first_image_token_id = self.processor.tokenizer.unk_token_id + 1
input_ids[:, start_index : (start_index + num_image_tokens)] = np.arange(
first_image_token_id, first_image_token_id + num_image_tokens
)
# make image attention mask
# which is zero except for the image tokens
img_attn_mask = np.zeros_like(input_ids)
img_attn_mask[:, start_index : (start_index + num_image_tokens)] = 1
# process image itself
image_encoding = self.processor.image_processor(images, return_tensors=return_tensors)
# turn to return_tensors
if return_tensors == 'pt':
input_ids = torch.from_numpy(input_ids)
img_attn_mask = torch.from_numpy(img_attn_mask)
elif return_tensors == None:
pass
else:
raise ValueError(f'Invalid return_tensors: {return_tensors}')
# wrap everything up
encoding = BatchFeature()
encoding['input_ids'] = input_ids
encoding['attention_mask'] = text_encoding['attention_mask']
encoding['img_attn_mask'] = img_attn_mask
encoding['pixel_values'] = image_encoding['pixel_values']
return encoding
def __getattr__(self, attr):
return getattr(self.processor, attr)
# Usage
model = ...
processor = ...
myprocessor = ProcessorWrapper(processor)
text = 'This image <image> is'
inputs = myprocessor(text=text, images=image, return_tensors="pt").to(model.device)
Hi @Shun-onoo
The current implemention takes the original demo here as reference.
Where it has a comment # TODO: input interleave image and text
. And yes, currently this repository made by me only deal with a single image and the information is put to beginning of the token sequences.
I don't want to give my promise to the approach that I am not 100% sure. You can however asking the Kosmos-2 authors on here.
But from what I know (roughtly), different inputs (image info + associated text) can be interleaved, but for things This image {image comes here} is ...
, I have to say I didn't see this format being mentioned.