internlm
/

CapRL-3B

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

yuhangzang commited on Sep 24

Commit

af935d2

·

verified ·

1 Parent(s): 37dfcd9

Update README.md

Files changed (1) hide show

README.md +18 -1

README.md CHANGED Viewed

@@ -6,4 +6,21 @@ pipeline_tag: image-to-text
 tags:
 - multimodal
 - image caption
----

 tags:
 - multimodal
 - image caption
+library_name: transformers
+---
+# CapRL-3B
+## Introduction
+We are excited to introduce CapRL-3B, a lightweight 3B captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
+This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
+open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
+can lead to models memorizing a limited set of annotated captions, our method allows the model to
+explore and generate a broader range of creative and general descriptions.
+CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
+stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
+caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
+curation pipeline to ensure the quality of the questions and answers used for the second stage.
+By employing our CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.