metadata

datasets:
  - laion/dalle-3-dataset
language:
  - en
tags:
  - art
  - image-to-text
  - image-captioning

DALL·E 3 Image prompt reverse-engineering

Pre-trained image-captioning model BLIP fine-tuned on a mixture of laion/dalle-3-dataset and semi-automatically gathered (image, prompt) data from DALLE·E 3. It takes a generated image as an input and outputs a potential prompt to generate such an image, which can then be used as a base to generate similar images.

⚠️ Disclaimer: This model is not intended for commercial use as the data it was trained on includes images generated by DALLE·E 3. This is for educational purposes only.

Usage:

Loading the model and preprocessor:

from transformers import BlipForConditionalGeneration, AutoProcessor

model = BlipForConditionalGeneration.from_pretrained("dblasko/blip-dalle3-img2prompt").to(device)
processor = AutoProcessor.from_pretrained("dblasko/blip-dalle3-img2prompt")

Inference example on an image from laion/dalle-3-dataset:

from datasets import load_dataset

dataset = load_dataset("laion/dalle-3-dataset", split=f'train[0%:1%]') # for fast download time in the toy example
example = dataset[img_index][0]
image = example["image"]
caption = example["caption"]

inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Generated caption: {generated_caption}\nReal caption: {caption}")