OpenEnvisionLab
/

Auto-Rubric-as-Reward

reinforcement-learning

rubric-policy-optimization

Model card Files Files and versions

Ferry1231 commited on 14 days ago

Commit

80410e0

·

1 Parent(s): ce0fdd0

update model card

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -1,6 +1,5 @@
 ---
-## license: apache-2.0
 library_name: diffusers
 tags:
   - text-to-image
@@ -15,6 +14,7 @@ tags:
 base_model:
   - black-forest-labs/FLUX.1-dev
   - Qwen/Qwen-Image-Edit
 # ARR-RPO
@@ -24,8 +24,8 @@ base_model:
 ARR-RPO provides two LoRA adapters trained with **Auto-Rubric as Reward (ARR)** and **Rubric Policy Optimization (RPO)** for visual generation:
-- `**ARR-FLUX.1-dev/`**: a LoRA adapter for FLUX.1-dev text-to-image generation.
-- `**ARR-Qwen-Image-Edit/**`: a LoRA adapter for Qwen-Image-Edit instruction-guided image editing.
 ARR-RPO uses a frozen VLM judge conditioned on explicit auto-generated rubrics. During RPO training, two candidate outputs are sampled for the same prompt or edit instruction, the ARR judge selects the preferred output, and the preferred/dispreferred candidates receive binary rewards. The goal is to improve prompt faithfulness, visual quality, compositional alignment, and edit fidelity without training a separate scalar reward model.

 ---
+license: apache-2.0
 library_name: diffusers
 tags:
   - text-to-image
 base_model:
   - black-forest-labs/FLUX.1-dev
   - Qwen/Qwen-Image-Edit
+---
 # ARR-RPO
 ARR-RPO provides two LoRA adapters trained with **Auto-Rubric as Reward (ARR)** and **Rubric Policy Optimization (RPO)** for visual generation:
+- **`ARR-FLUX.1-dev/`**: a LoRA adapter for FLUX.1-dev text-to-image generation.
+- **`ARR-Qwen-Image-Edit/`**: a LoRA adapter for Qwen-Image-Edit instruction-guided image editing.
 ARR-RPO uses a frozen VLM judge conditioned on explicit auto-generated rubrics. During RPO training, two candidate outputs are sampled for the same prompt or edit instruction, the ARR judge selects the preferred output, and the preferred/dispreferred candidates receive binary rewards. The goal is to improve prompt faithfulness, visual quality, compositional alignment, and edit fidelity without training a separate scalar reward model.