File size: 4,445 Bytes
8939f40
 
 
 
 
 
 
 
 
 
 
 
2fae03b
748ca0d
a7a3bcd
748ca0d
 
 
a7a3bcd
 
 
748ca0d
a7a3bcd
 
 
 
 
 
 
 
 
 
 
 
748ca0d
a7a3bcd
748ca0d
a7a3bcd
 
 
 
748ca0d
a7a3bcd
 
 
 
 
748ca0d
 
a7a3bcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
748ca0d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- label
---
![VSXzdfgvsdxf.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/nNF_6UCnmgHKjNmLaA2QA.png)

# **Caption-Pro**

**Caption-Pro** is an advanced image caption and annotation generator optimized for generating detailed, structured JSON outputs. Built upon a powerful vision-language architecture with enhanced OCR and multilingual support, Caption-Pro extracts high-quality captions and annotations from images for seamless integration into your applications.

#### Key Enhancements:

* **Advanced Image Understanding**: Fine-tuned on millions of annotated images, Caption-Pro delivers precise comprehension and interpretation of visual content.
* **Optimized for JSON Output**: Produces structured JSON data containing captions and detailed annotations—perfect for integration with databases, APIs, and automation pipelines.
* **Enhanced OCR Capabilities**: Accurately extracts textual content from images in multiple languages, including English, Chinese, Japanese, Korean, Arabic, and more.
* **Multimodal Processing**: Seamlessly handles both image and text inputs, generating comprehensive annotations based on the provided image.
* **Multilingual Support**: Recognizes and processes text within images across various languages.
* **Secure and Optimized Model Weights**: Employs safetensors for efficient and secure model loading.

### How to Use

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the Caption-Pro model with optimized parameters
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Caption-Pro", torch_dtype="auto", device_map="auto"
)

# Recommended acceleration for performance optimization:
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/Caption-Pro",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Load the default processor for Caption-Pro
processor = AutoProcessor.from_pretrained("prithivMLmods/Caption-Pro")

# Define the input messages with both an image and a text prompt
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://flux-generated.com/sample_image.jpeg",
            },
            {"type": "text", "text": "Provide detailed captions and annotations for this image in JSON format."},
        ],
    }
]

# Prepare the input for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

### **Key Features**

1. **Annotation-Ready Training Data**  
   - Trained using a diverse dataset of annotated images to ensure high-quality structured output.

2. **Optical Character Recognition (OCR)**  
   - Robustly extracts and processes text from images in various languages and scripts.

3. **Structured JSON Output**  
   - Generates detailed captions and annotations in standardized JSON format for easy downstream integration.

4. **Image & Text Processing**  
   - Capable of handling both visual and textual inputs, delivering comprehensive and context-aware annotations.

5. **Conversational Annotation Generation**  
   - Supports multi-turn interactions, enabling detailed and iterative refinement of annotations.

6. **Secure and Efficient Model Weights**  
   - Uses safetensors for enhanced security and optimized model performance.

**Caption-Pro** streamlines the process of generating image captions and annotations, making it an ideal solution for applications that require detailed visual content analysis and structured data integration.