Efficient-Large-Model
/

LongVILA-R1-7B

Model card Files Files and versions

Yukang commited on Jul 27, 2025

Commit

7b163bc

·

verified ·

1 Parent(s): 8f2be92

Update README.md

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -22,6 +22,7 @@ tags:
  <p>
   <strong>LongVILA-R1-7B</strong> supports both <u>multiple-choice</u> questions and <u>open-ended</u> questions. It can switch between thinking and non-thinking modes.<br>
   <strong>LongVILA-R1-7B</strong> demonstrates strong performance in long video reasoning, achieving <strong>70.7%</strong> on VideoMME (w/ sub.) and surpassing Gemini-1.5-Pro across diverse reasoning tasks.<br>
   <strong>Long-RL</strong> is a codebase that accelerates long video RL training by up to <strong>2.1×</strong> through its MR-SP system. It supports RL training on image, video, and omni inputs across VILA, Qwen/Qwen-VL, and diffusion models.
 </p>
@@ -48,6 +49,9 @@ from transformers import AutoModel
 model_path = "Efficient-Large-Model/LongVILA-R1-7B"
 model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
 use_thinking = True # Switching between thinking and non-thinking modes
 system_prompt_thinking = "You are a helpful assistant. The user asks a question, and then you solves it.\n\nPlease first think deeply about the question based on the given video, and then provide the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.\n\n Question: {question}"

  <p>
   <strong>LongVILA-R1-7B</strong> supports both <u>multiple-choice</u> questions and <u>open-ended</u> questions. It can switch between thinking and non-thinking modes.<br>
   <strong>LongVILA-R1-7B</strong> demonstrates strong performance in long video reasoning, achieving <strong>70.7%</strong> on VideoMME (w/ sub.) and surpassing Gemini-1.5-Pro across diverse reasoning tasks.<br>
+  <strong>LongVILA-R1-7B</strong> supports processing up to <strong>2,048</strong> video frames per video, with configurable FPS settings.<br>
   <strong>Long-RL</strong> is a codebase that accelerates long video RL training by up to <strong>2.1×</strong> through its MR-SP system. It supports RL training on image, video, and omni inputs across VILA, Qwen/Qwen-VL, and diffusion models.
 </p>
 model_path = "Efficient-Large-Model/LongVILA-R1-7B"
 model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
+# You can disable FPS control by setting it to 0, and customize the number of processed num_video_frames as desired.
+#model.config.num_video_frames, model.config.fps = 512, 0
 use_thinking = True # Switching between thinking and non-thinking modes
 system_prompt_thinking = "You are a helpful assistant. The user asks a question, and then you solves it.\n\nPlease first think deeply about the question based on the given video, and then provide the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.\n\n Question: {question}"