Image in an arbitrary position in the input text (Plus multiple images as input)

#14
by Shun-onoo - opened

Hello. First, I appreciate you sharing this great work.

By default, your model always inserts an image at the beginning of the input text. However, I wanted to use text interleaved with an image as an input, such as "This image {image comes here} is ...". So I made a wrapper class for the processor to accept an image in an arbitrary position in the text. Here's the code. When using this wrapped processor, you can use <image> to specify the location of the image in the input text.
I tried this, and it seemed to work (it can't accept bounding boxes, though). Do you think this wrapper is reasonable and works fine?
Also, I want to insert multiple images in the input, but I haven't figured out how. Do you have any plans to release codes for multiple images?

Thank you in advance.

class ProcessorWrapper:

    def __init__(self, processor):
        self.processor = processor

    def __call__(
        self,
        images = None,
        text = None,
        bboxes = None,
        num_image_tokens = 64,
        first_image_token_id = None,
        add_special_tokens = True,
        padding = False,
        truncation = None,
        max_length = None,
        stride = 0,
        pad_to_multiple_of = None,
        return_attention_mask = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_token_type_ids: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        return_tensors = None,
        **kwargs,
    ):
        """
        Preprocess text and image for Kosmos-2 model.

        Args:
            text (str): The text to be encoded. <image> specifies the location of the image embeddings in the text.
        """
        # add fake <image><image>...<image></image> to the text
        # these tokens represent the location of the image embeddings
        # the space at the end of suffix is necessary to match the original behavior
        text = insert_images(text, num_image_tokens=num_image_tokens, suffix='</image> ')  
        text_encoding = self.processor.tokenizer(
            text=text,
            add_special_tokens=add_special_tokens,
            padding=padding,
            truncation=truncation,
            max_length=max_length,
            stride=stride,
            pad_to_multiple_of=pad_to_multiple_of,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_token_type_ids=return_token_type_ids,
            return_length=return_length,
            verbose=verbose,
            return_tensors=return_tensors,
            **kwargs,
        )

        # find the start of the image tokens
        input_ids = np.array(text_encoding['input_ids'])
        # here, start_index shows the actual encoding position of the first image token
        # don't forget to add 1 for the first <image>
        start_index = np.where(input_ids[0]==64003)[0][0] + 1

        # Replace fake <image> tokens with range
        first_image_token_id = self.processor.tokenizer.unk_token_id + 1
        input_ids[:, start_index : (start_index + num_image_tokens)] = np.arange(
            first_image_token_id, first_image_token_id + num_image_tokens
        )

        # make image attention mask
        # which is zero except for the image tokens
        img_attn_mask = np.zeros_like(input_ids)
        img_attn_mask[:, start_index : (start_index + num_image_tokens)] = 1

        # process image itself
        image_encoding = self.processor.image_processor(images, return_tensors=return_tensors)

        # turn to return_tensors
        if return_tensors == 'pt':
            input_ids = torch.from_numpy(input_ids)
            img_attn_mask = torch.from_numpy(img_attn_mask)
        elif return_tensors == None:
            pass
        else:
            raise ValueError(f'Invalid return_tensors: {return_tensors}')

        # wrap everything up
        encoding = BatchFeature()
        encoding['input_ids'] = input_ids
        encoding['attention_mask'] = text_encoding['attention_mask']
        encoding['img_attn_mask'] = img_attn_mask
        encoding['pixel_values'] = image_encoding['pixel_values']

        return encoding

    def __getattr__(self, attr):
        return getattr(self.processor, attr)

# Usage
model = ...
processor = ...
myprocessor = ProcessorWrapper(processor)
text = 'This image <image> is'
inputs = myprocessor(text=text, images=image, return_tensors="pt").to(model.device)

Hi @Shun-onoo

The current implemention takes the original demo here as reference.
Where it has a comment # TODO: input interleave image and text. And yes, currently this repository made by me only deal with a single image and the information is put to beginning of the token sequences.

I don't want to give my promise to the approach that I am not 100% sure. You can however asking the Kosmos-2 authors on here.

But from what I know (roughtly), different inputs (image info + associated text) can be interleaved, but for things This image {image comes here} is ..., I have to say I didn't see this format being mentioned.

ydshieh changed discussion status to closed

Sign up or log in to comment