HymanH/AITQE · Hugging Face

[2024.10.12] Release the inference code and pre-trained model of AITQE.

We propose the Adaptive Image-Text Quality Enhancer, AITQE, a model that dynamically assesses and enhances the quality of image-text pairs. The conventional method (a) discards low-quality samples in raw data, reducing the amount of pretraining data, while our AITQE (b) enhances low-quality samples, retaining the same volume of data for MLLMs pretraining.

Specifically, for pairs exhibiting low quality-such as low semantic similarity between modalities or subpar linguistic quality, AITQE performs text rewriting, generating high-quality text based on the input image and the raw low-quality text.

Use the code from github:

python inference.py \
       --model_path /path/to/AITQE \
       --output_all
       --gpu_id 0 \
       --image_path ./figs/test.png \
       --caption "Some random text to the image like this is a test"

and get the following output:

{"Recaption": "A man stands in front of a checklist of customer service questions, including 'Do you take each customer seriously?' and 'Do you qualify customers properly?'", "Overall Score": "2", "Overall Explanation": "The caption is vague and does not accurately describe the image or its content. It lacks detail and relevance to the checklist shown in the image.", "Text Quality Score": 3, "Text Quality Explanation": "The caption is grammatically correct but lacks clarity and relevance to the image. It is vague and does not provide a meaningful description.", "Image-Text Matching Score": 2, "Image-Text Matching Explanation": "The caption does not accurately describe the image, which features a checklist of customer service questions. The caption is unrelated to the content of the image.", "Object Detail Score": 2, "Object Detail Explanation": "The caption does not provide any details about the objects in the image, such as the checklist or the person in the background.", "Semantic Understanding Score": 2, "Semantic Understanding Explanation": "The caption fails to convey any understanding of the image's context or purpose, which is about customer service evaluation.", "Text/Chart Description Score": 2, "Text/Chart Description Explanation": "The caption does not describe the text in the image, which is a checklist of customer service questions."}

HymanH
/

AITQE

Beyond Filtering:
Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Model tree for HymanH/AITQE

Beyond Filtering:Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Model tree for HymanH/AITQE

Beyond Filtering:
Adaptive Image-Text Quality Enhancement for MLLM Pretraining