microsoft/kosmos-2-patch14-224 · KeyError: 'kosmos-2'

Oct 31, 2023

Thank you for the great work!

I am trying to run the example in README. However, I got KeyError: 'kosmos-2' after running model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").

The version of my transformers is 4.33.3. Is this an issue of the version? If so, which version should I install?

Also, could you provide an input example with interleaved text and multiple images? I am confused how to construct the input for an input sequence with text and multiple images.

Thank you for the help!

ydshieh

Oct 31, 2023

Hi, you have to use the latest dev version (from the main branch). There will be a release this week if you could wait.

ydshieh

Oct 31, 2023

Also, could you provide an input example with interleaved text and multiple images? I am confused how to construct the input for an input sequence with text and multiple images.

This is not shown explicitly in the paper IIRC. In the original Microsoft GitHub repository, there is some code regarding this, but not easy to run it to see what the format it has.
The current Kosmos2Processor is designed to handle a text or an image with a text, but not interleaved data.

I will try to see what I can provide regarding part in the next few days.

lysandre

Microsoft org Oct 31, 2023

The release will take place on Thursday @yingss

yingss

Oct 31, 2023

The release will take place on Thursday @yingss

Will this release support interleaved text and multiple images?

I am mostly interested in the capability of accepting interleaved text and multiple images showcased in kosmos-1. However, I could not find the checkpoint for kosmos-1 in the official repo or huggingface. I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

ydshieh

Nov 1, 2023

I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

Kosmos-2 indeed is also trained on the interleaved data, but the official demo never shows how this is used, as you can see

https://github.com/microsoft/unilm/blob/7ae2ee53bf7fff85e730c72083b7e999b0b9ba44/kosmos-2/demo/gradio_app.py#L100C8-L100C9

In Kosmos-1 paper, they mentioned

and

I can provide a helper method to deal with this case, but so far it won't be in the official release: I will post here in the next comment.

yingss

Nov 1, 2023

I am assuming kosmos-2 will have similar capabilities in terms of handling interleaved text and multiple images?

Kosmos-2 indeed is also trained on the interleaved data, but the official demo never shows how this is used, as you can see

https://github.com/microsoft/unilm/blob/7ae2ee53bf7fff85e730c72083b7e999b0b9ba44/kosmos-2/demo/gradio_app.py#L100C8-L100C9

In Kosmos-1 paper, they mentioned

and

I can provide a helper method to deal with this case, but so far it won't be in the official release: I will post here in the next comment.

Thank you so much!

panopstor

Nov 3, 2023

Transformers 4.35.0 is released on pypi and works with kosmos-2.

ydshieh

Nov 3, 2023

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

yingss

Nov 3, 2023

I need to take a final look again, but the following should work
(remember that this implementation is based what I see in the paper instead of an original implementation!)

The helper function

from transformers import BatchFeature


def process_interleaved_example(processor, prompt, images, placeholder="<i>", num_image_tokens=64, add_special_tokens=True, add_eos_token=False, return_tensors=None):

    first_image_token_id = processor.tokenizer.unk_token_id + 1

    image_input_ids = [processor.tokenizer.convert_tokens_to_ids(processor.boi_token)] + list(range(first_image_token_id, num_image_tokens + first_image_token_id)) + [processor.tokenizer.convert_tokens_to_ids(processor.eoi_token)]
    image_attention_mask = [1] * len(image_input_ids)
    # `-2`: not including `boi` and `eoi`
    image_embeds_position_mask = [0] + [1] * (len(image_input_ids) - 2) + [0]

    import re
    components = re.split(rf"({placeholder})", prompt)

    outputs = {"input_ids": [], "attention_mask": [], "image_embeds_position_mask": []}
    for component in components:
        if component != "<i>":
            # add text tokens: no special tokens -> add them at the end
            encoded = processor(text=component, add_special_tokens=False)
            for key in ["input_ids", "attention_mask"]:
                outputs[key].extend(encoded[key])
            outputs["image_embeds_position_mask"].extend([0] * len(encoded["input_ids"]))
        else:
            # add tokens to indicate image placeholder
            outputs["input_ids"].extend(image_input_ids)
            outputs["attention_mask"].extend(image_attention_mask)
            outputs["image_embeds_position_mask"].extend(image_embeds_position_mask)

    if add_special_tokens:
        outputs["input_ids"] = [processor.tokenizer.bos_token_id] + outputs["input_ids"] + ([processor.tokenizer.eos_token_id] if add_eos_token else [])
        outputs["attention_mask"] = [1] + outputs["attention_mask"] + ([1] if add_eos_token else [])
        outputs["image_embeds_position_mask"] = [0] + outputs["image_embeds_position_mask"] + ([0] if add_eos_token  else [])

    outputs["pixel_values"] = processor.image_processor(images).pixel_values

    for k in ["input_ids", "attention_mask", "image_embeds_position_mask"]:
        outputs[k] = [outputs[k]]
    outputs = BatchFeature(data=outputs,tensor_type=return_tensors)

    return outputs

An example use it:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt")
print(inputs)

model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
outputs = model(**inputs)
print(outputs[0].shape)

Thank you so much!!

zhaominxiao

Nov 19, 2023

Could you please provide the implementation for decoding?

ydshieh

Nov 19, 2023

@zhaominxiao

The code example in the model card should work well

https://huggingface.co/microsoft/kosmos-2-patch14-224

but let me know if there is anything missing

zhaominxiao

Nov 19, 2023

@zhaominxiao

The code example in the model card should work well

https://huggingface.co/microsoft/kosmos-2-patch14-224

but let me know if there is anything missing

Thanks for your prompt response. Yes, the code can generate the output tensor. But when I tried to use the method in the README file to decode the output tensor, it returned the error saying "TypeError: PreTrainedTokenizerBase.decode() missing 1 required positional argument: 'token_ids'."

The decoding method I used is as follows.

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq


url_1 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image_1 = Image.open(requests.get(url_1, stream=True).raw)

url_2 = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/two_dogs.jpg"
image_2 = Image.open(requests.get(url_2, stream=True).raw)

prompt = "<grounding> There are <i> two dogs want to play with <i> this lonely snowman."
inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], return_tensors="pt")
print(inputs)

inputs = process_interleaved_example(processor, prompt, images=[image_1, image_2], add_eos_token=True, return_tensors="pt").to('cuda')
print(inputs)

outputs = model(**inputs)
print(outputs[0].shape)

generated_text = processor.decode(**outputs, skip_special_tokens=True)[0]
processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)
processed_text, _ = processor.post_process_generation(generated_text)

ydshieh

Nov 19, 2023

Hi, first : I am not sure if you really intend to use model(**inputs) instead of model.generate.

model(**inputs) gives outputs as something like dictionary, and processor.decode expect a list of token ids.

I would suggest you follow the code example to use model.generate and see how it use its outputs to decode.

If you intend to use model(**inputs), you will have to do extra work to make it work, for which I won't have the bandwidth to help.

zhaominxiao

Nov 19, 2023

@ydshieh
Thanks for your help. I am totally ok with using model.generate. I checked the code example in the model card. It is about how to use one image and one text sequence to do visual question answering (or image captioning). My use case needs the model to consume interleaved text-and-image sequences. For example, to do the few-shot learning, I need to provide multiple examples before asking the "real" question. In this case, I need to make sure the text and images are presented to KOSMOS in order. I went through your helper function implementation and my understanding is that with the dictionary returned by your helper function, I can use its input_ids, attention_mask, and image_embeds_position_mask as the value of the parameters in model.generate. But I am not sure how I can set the pixel_values and image_embeds. Should I set them as None?

zhaominxiao

Nov 19, 2023

@ydshieh
Oh, I think I got the answer. The pixel_values was assigned some values in the helper function. Regarding the image_embeds, I think I should leave it as None.

Thank you very much!

ydshieh

Nov 20, 2023

Yes, usually you don't need image_embeds. Passing pixel_values is the usual case.

himanshu1406

Feb 17

@ydshieh As given above the function of process_interleaved_example, I used to process the images and the text inputs, then, I used model.generate to generate the ids which are then decoded and processed into the final output. I am getting the results but the results are not at all good. I have given a series of images which are from a video. The description of the images which I am getting is different. Please help me if I am applying the code logically wrong anywhere.

def run_example_kosmos(model, processor, image, prompt):
    inputs = process_interleaved_example(processor, prompt, images=image, add_eos_token=True, return_tensors="pt")
    generated_ids = model.generate(
      pixel_values=inputs["pixel_values"],
      input_ids=inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      image_embeds=None,
      image_embeds_position_mask=inputs["image_embeds_position_mask"],
      use_cache=True,
      max_new_tokens=300,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    _processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False)
    processed_text, entities = processor.post_process_generation(generated_text)
    print(processed_text)
    return processed_text