yuhangzang commited on
Commit
af935d2
·
verified ·
1 Parent(s): 37dfcd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -1
README.md CHANGED
@@ -6,4 +6,21 @@ pipeline_tag: image-to-text
6
  tags:
7
  - multimodal
8
  - image caption
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - multimodal
8
  - image caption
9
+ library_name: transformers
10
+ ---
11
+
12
+ # CapRL-3B
13
+ ## Introduction
14
+ We are excited to introduce CapRL-3B, a lightweight 3B captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
15
+
16
+ This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
17
+ open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
18
+ can lead to models memorizing a limited set of annotated captions, our method allows the model to
19
+ explore and generate a broader range of creative and general descriptions.
20
+
21
+ CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
22
+ stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
23
+ caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
24
+ curation pipeline to ensure the quality of the questions and answers used for the second stage.
25
+
26
+ By employing our CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.