ScienceOne-AI
/

S1-VL-32B

Safetensors

qwen3_vl

Model card Files Files and versions

xet

Community

ScienceOne-AI commited on 17 days ago

Commit

754a2c7

verified ·

1 Parent(s): 9462f3c

Upload 3 files

Browse files

Files changed (3) hide show

CASES.md +7 -17
README.md +35 -21
README_zh.md +27 -14

CASES.md CHANGED Viewed

@@ -1490,14 +1490,13 @@ $$\boxed{(111)}$$
 ```
 <think>
-To identify the objects predominantly visible in the residential area, we need to focus on the specific region of the image that contains the residential buildings. Cropping and zooming are essential because they allow us to isolate the relevant section of the image, reducing distractions from other parts of the scene and enhancing the clarity of the objects we are interested in.
-The goal is to analyze the residential area in detail. To improve visibility, I will crop the region around the residential buildings and zoom in. Based on the provided image, the residential area is located near the bottom right corner. I will refine this region to increase clarity and focus on the relevant portion of the image.
 </think>
 <tool_call>
-{"name": "python", "arguments": {"code": "import cv2\nimport os\nimport numpy as np\nfrom uuid import uuid4\n\n# Load the image\nimage_path = \"/mnt/data/images/P2504_0017.png\"\nimage = cv2.imread(image_path)\n\n# Define the coordinates for cropping (based on visual inspection)\nx1, y1, x2, y2 = 375, 396, 512, 512\n\n# Crop the image\ncropped_image = image[y1:y2, x1:x2]\n\n# Apply a zoom effect by resizing the cropped image\nzoom_factor = 2\nzoomed_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_LINEAR)\n\n# Save the processed image with a random suffix\ntemp_dir = \"/mnt/data/images/temp/\"\nrandom_suffix = uuid4().hex[:8]\nfilename = os.path.basename(image_path)\nprocessed_filename = f\"{os.path.splitext(filename)[0]}_{random_suffix}{os.path.splitext(filename)[1]}\"\nprocessed_path = os.path.join(temp_dir, processed_filename)\ncv2.imwrite(processed_path, zoomed_image)\n\n# Print the path of the saved cropped image for further processing\nprint(processed_path)"}}
-</tool_call>
 ```
@@ -1512,19 +1511,12 @@ The goal is to analyze the residential area in detail. To improve visibility, I
 ```
 <think>
-The goal of the task is to identify the predominant objects in the residential area visible in the image. To achieve this, I will focus on the specific region of interest, which is the residential area. By cropping and zooming into this area, I can enhance the visibility of the objects and provide a more detailed analysis.
-### Step 1: Cropping and Zooming
-I will crop the image to focus on the residential area and then apply a zoom effect to enhance the details. This will help in identifying the objects more clearly.
-### Step 2: Analysis of the Cropped Image
-After cropping and zooming, I will analyze the image to identify the predominant objects in the residential area. The cropped image will likely show houses, trees, and possibly other structures or features typical of a residential neighborhood.
-### Step 3: Final Answer
-Based on the analysis of the cropped and zoomed image, I will provide a detailed answer to the user's question about the predominant objects in the residential area.
 </think>
-In the residential area, houses with distinct roofs are predominantly visible. The houses appear to be closely packed, with visible roads and some greenery interspersed among them.
 ```
@@ -1646,9 +1638,7 @@ After re-evaluating the image, it's evident that the blue tag with the number is
 ```
 <think>
-To determine the number written on the blue tag, it's essential to focus on the specific area of the image where the tag is located. The tag is partially obscured by flowers and foliage, making it challenging to read the number directly. Cropping and zooming into the relevant section will enhance visibility and allow for a clear identification of the number.
-The goal is to extract the number from the blue tag, which is located near a hanging plant. To improve visibility, I will crop the region around the tag and zoom in. Based on the provided image, the tag is located near the center-left of the image, partially hidden by flowers. I will refine this region to increase clarity and focus on the relevant portion of the image.
 </think>
 D

 ```
 <think>
+To identify the objects predominantly visible in the residential area, it\'s essential to focus on the specific region of the image that contains the residential area. Cropping and zooming are critical because they allow us to isolate the relevant portion of the image, reducing distractions and enhancing the visibility of the objects we are interested in. In this case, the residential area is located in the lower right section of the image, and by cropping this region, we can better observe the details of the houses, roads, and other structures.
+The goal of the task is to identify the objects predominantly visible in the residential area. To improve visibility, I will crop the region around the residential area and zoom in. Based on the provided image, the residential area is located in the lower right section. I will refine this region to increase clarity and focus on the relevant portion of the image.
 </think>
 <tool_call>
+{"name": "python", "arguments": {"code": "import cv2\\nimport os\\nimport numpy as np\\nfrom uuid import uuid4\\n\\n# Load the image\\nimage_path = \\"/mnt/data/images/P2504_0017.png\\"\\nimage = cv2.imread(image_path)\\n\\n# Define the coordinates for the residential area (inferred from the image)\\nx1, y1, x2, y2 = 256, 320, 512, 512\\n\\n# Crop the image\\ncropped_image = image[y1:y2, x1:x2]\\n\\n# Apply a zoom effect by resizing the cropped image\\nzoom_factor = 2\\nzoomed_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_LINEAR)\\n\\n# Save the processed image with a random suffix\\ntemp_dir = \\"/mnt/data/images/temp/\\"\\nrandom_suffix = uuid4().hex\\nfilename = os.path.basename(image_path)\\nprocessed_filename = f\\"{os.path.splitext(filename)[0]}_{random_suffix}{os.path.splitext(filename)[1]}\\"\\nprocessed_path = os.path.join(temp_dir, processed_filename)\\ncv2.imwrite(processed_path, zoomed_image)\\n\\n# Print the path of the saved cropped image\\nprint(processed_path)"}}\n</tool_call>
 ```
 ```
 <think>
+The goal of the task is to identify the objects predominantly visible in the residential area of the image. To achieve this, I first cropped the image to focus on the residential area, which is located in the lower right portion of the image. However, the initial crop did not effectively highlight the residential area, as it included a larger portion of the image that was not relevant. Therefore, I refined the crop to focus more precisely on the residential area.
+After applying the refined crop, the image now clearly shows the residential area. The objects predominantly visible in this area include houses, roads, and some greenery. The houses are arranged in a grid-like pattern, with roads connecting them. The greenery includes trees and grassy areas, which are interspersed among the houses. This refined view allows for a more accurate analysis of the objects in the residential area.
 </think>
+In the residential area, the objects predominantly visible are houses, roads, and some greenery. The houses are arranged in a grid-like pattern, with roads connecting them. The greenery includes trees and grassy areas, which are interspersed among the houses.
 ```
 ```
 <think>
+The goal of the task is to identify the number written on the blue tag. To achieve this, I first cropped the image based on the provided bounding box. However, upon reviewing the cropped image, it became clear that the blue tag was not prominently visible in this region. The cropped area primarily focused on the wall and some plants, which did not include the blue tag. Therefore, I refined the crop further to focus specifically on the blue tag.\n\nAfter applying the second crop, the blue tag is now clearly visible. The number written on the blue tag is "23". Therefore, the correct answer is D.
 </think>
 D

README.md CHANGED Viewed

@@ -1,27 +1,42 @@
----
-license: apache-2.0
----
 # S1-VL-32B: Scientific Multimodal Reasoning Model
 [中文版](./README_zh.md) | [English](./README.md)
 ## 🔬 Introduction
-**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — **Multimodal Reasoning** and **Thinking with Images** — and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.
-- **Multimodal Reasoning Mode**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
-- **Thinking with Images Mode**: Enables the model to actively invoke code tools during the reasoning process to perform image operations — including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking — before generating responses.
-We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training data. A **multi-stage post-training procedure** is employed to progressively unlock the scientific reasoning capabilities of S1-VL-32B:
-- **Stage 1**: Large-scale multimodal instruction data spanning multiple disciplines — including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
-- **Stage 2**: The **Thinking with Images** reasoning paradigm is introduced. Through high-quality **scientific reasoning data annealing**, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data.
 ## 📂 Model Weights
 | Model | Parameters | HuggingFace | ModelScope |
 |-------|-----------|-------------|------------|
-| S1-VL-32B | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
 ## 🏆 Evaluation Results
@@ -56,7 +71,7 @@ pip install qwen-vl-utils==0.0.14
 ### 2. Start the vLLM Service
 ```bash
-vllm serve ScienceOne-AI/S1-VL-32B \
     --tensor-parallel-size 4 \
     --max-model-len 32768 \
     --limit-mm-per-prompt image=15 \
@@ -78,7 +93,7 @@ with open("path/to/your/image.png", "rb") as f:
     image_data = base64.b64encode(f.read()).decode("utf-8")
 response = client.chat.completions.create(
-    model="ScienceOne-AI/S1-VL-32B",
     messages=[
         {
             "role": "user",
@@ -88,8 +103,7 @@ response = client.chat.completions.create(
             ],
         }
     ],
-    temperature=0.6,
-    top_p=0.95,
     max_tokens=16384,
 )
@@ -167,14 +181,14 @@ print(final["content"])
 ## 📄 Citation
-If you use S1-VL-32B in your research, please cite (the corresponding paper is coming soon):
 ```latex
-@misc{s1vl2026,
-  title        = {S1-VL-32B: Scientific Multimodal Reasoning Model},
-  author       = {ScienceOne Team},
-  year         = {2026},
-  howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}}
 }
 ```
@@ -184,4 +198,4 @@ This project is released under the Apache 2.0 License.
 ## 🙏 Acknowledgements
-We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.

 # S1-VL-32B: Scientific Multimodal Reasoning Model
 [中文版](./README_zh.md) | [English](./README.md)
 ## 🔬 Introduction
+**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne AI team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — **Scientific Reasoning** and **Thinking with Images** — and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.
+- **Scientific Reasoning**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
+- **Thinking with Images**: Enables the model to actively invoke code tools during the reasoning process to perform image operations — including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking — before generating responses.
+We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training reasoning trajectories.
+<div align="center">
+<img src="./image/data_pipeline.png"/>
+</div>
+We adopt a **four-stage progressive post-training procedure** to progressively unlock the scientific reasoning capabilities of S1-VL-32B:
+- **Stage 1 - Scientific Reasoning SFT**: Large-scale multimodal instruction data spanning multiple disciplines — including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
+- **Stage 2 - Thinking-with-Images Cold-Start SFT**: The **Thinking with Images** reasoning paradigm is introduced. Through joint training with high-quality **scientific reasoning curriculum learning data** and image-thinking data, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and astronomical observation data (S1-VL-32B-SFT).
+- **Stage 3 - Scientific Reasoning RL**: Based on the **SAPO algorithm** and a multi-task scientific reward function, reinforcement learning is applied to challenging scientific multimodal reasoning samples to push beyond the performance ceiling of the SFT stage.
+- **Stage 4 - Thinking-with-Images RL**: Based on the **SAPO algorithm** and a four-dimensional composite reward function, the model's image operation invocation timing and quality are further optimized, enabling stable and efficient multi-round visual reasoning (S1-VL-32B-RL).
+    <div align="center">
+    <img src="./image/s1-vl-training-pipeline.png"/>
+    </div>
+🔥 **[NEW]** Technical report released: [S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)
+🔥 **[NEW]** Stage 3 and Stage 4 reinforcement learning training added; [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) model weights updated.
 ## 📂 Model Weights
 | Model | Parameters | HuggingFace | ModelScope |
 |-------|-----------|-------------|------------|
+| S1-VL-32B-SFT | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
+| S1-VL-32B-RL | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |
 ## 🏆 Evaluation Results
 ### 2. Start the vLLM Service
 ```bash
+vllm serve ScienceOne-AI/S1-VL-32B-RL \
     --tensor-parallel-size 4 \
     --max-model-len 32768 \
     --limit-mm-per-prompt image=15 \
     image_data = base64.b64encode(f.read()).decode("utf-8")
 response = client.chat.completions.create(
+    model="ScienceOne-AI/S1-VL-32B-RL",
     messages=[
         {
             "role": "user",
             ],
         }
     ],
+    temperature=0.2,
     max_tokens=16384,
 )
 ## 📄 Citation
+If you use S1-VL-32B in your research, please cite:
 ```latex
+@article{li2026s1vl,
+  title     = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
+  author    = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
+  journal   = {arXiv preprint arXiv:2604.21409},
+  year      = {2026},
 }
 ```
 ## 🙏 Acknowledgements
+We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.

README_zh.md CHANGED Viewed

@@ -3,20 +3,33 @@
 [中文版](./README_zh.md) ｜ [English](./README.md)
 ## 🔬 模型简介
-**S1-VL-32B** 是由中国科学院 “磐石 · 科学基础大模型” ScienceOne 团队研发的面向科学领域的多模态大语言模型，原生支持 **Multimodal Reasoning（多模态推理）** 与 **Thinking with Images（图像思考）** 两种推理范式，在多项主流科学多模态评测基准上达到当前最优水平。
-- **Multimodal Reasoning 模式**：基于思维链的多模态科学推理，适用于复杂多步问题的分析与求解。
 - **Thinking with Images 模式**：允许模型在推理过程中在思考过程中主动调用代码工具进行图像操作（包括裁剪、放缩、图像增强、画框标注、描点标记等）再生成回答。
-我们建立**跨学科体系的数据处理管道**对视觉推理轨迹进行多维度效用评估与筛选，确保训练推理轨迹的质量；并采用多阶段后训练流程逐步解锁S1-VL-32B 模型的科学推理能力：
-- 首先，基于涵盖**数理化天地生**等多学科的大规模多模态指令数据进行混合训练，提升模型科学视觉理解和逻辑推理能力，使模型在学术图像问答、医学影像分析、化学结构识别等方面奠定坚实基础；
-- 然后，引入 **Thinking with Images** 推理范式，通过高质量**科学推理数据退火**，使模型具备在推理过程中通过代码进行**图像操作**的能力，在需要精细化图像分析的场景中表现尤为突出，尤其擅长解读密集科学图表、高分辨率遥感图像、显微图像及天文观测数据等复杂视觉场景。
 ## 📂 模型权重
 | 模型名称 | 参数量 | HuggingFace | ModelScope |
 |--------|------|-------------|------------|
-| S1-VL-32B | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
 ## 🏆 评测结果
@@ -31,7 +44,7 @@ S1-VL-32B 在上述评测中展现出突出的综合竞争力。在**科学多
 ## 🧠 案例展示
-以下展示 S1-VL-32B 在 **Thinking with Images** 模式下的推理案例。S1-VL-32B在处理一张低分辨率的颈部CT图像的思考过程中主动调用代码工具，对目标区域进行**裁剪与放大**，获取更清晰的局部图像后，再结合模型内部知识完成推理。
 <div align="center">
 <img src="./image/s1-vl-32b-twi.png"/>
@@ -71,7 +84,7 @@ with open("path/to/your/image.png", "rb") as f:
     image_data = base64.b64encode(f.read()).decode("utf-8")
 response = client.chat.completions.create(
-    model="ScienceOne-AI/S1-VL-32B",
     messages=[
         {
             "role": "user",
@@ -159,13 +172,13 @@ print(final["content"])
 ## 📄 引用
-如果您在研究中使用了 S1-VL-32B，欢迎引用（相关论文即将发布，敬请期待）：
 ```latex
-@misc{s1vl2026,
-  title        = {S1-VL-32B: Scientific Multimodal Reasoning Model},
-  author       = {ScienceOne Team},
-  year         = {2026},
-  howpublished = {\url{https://huggingface.co/ScienceOne-AI/S1-VL-32B}}
 }
 ```

 [中文版](./README_zh.md) ｜ [English](./README.md)
 ## 🔬 模型简介
+**S1-VL-32B** 是由中国科学院 “磐石 · 科学基础大模型” ScienceOne AI 团队研发的面向科学领域的多模态大语言模型，原生支持 **Scientific Reasoning（多模态推理）** 与 **Thinking with Images（图像思考）** 两种推理范式，在多项主流科学多模态评测基准上达到当前最优水平。
+- **Scientific Reasoning 模式**：基于思维链的科学多模态推理，适用于复杂多步问题的分析与求解。
 - **Thinking with Images 模式**：允许模型在推理过程中在思考过程中主动调用代码工具进行图像操作（包括裁剪、放缩、图像增强、画框标注、描点标记等）再生成回答。
+我们建立**跨学科体系的数据处理管道**对视觉推理轨迹进行多维度效用评估与筛选，确保训练推理轨迹的质量。我们采用**四阶段渐进式后训练流程**逐步解锁 S1-VL-32B 模型的科学推理能力：
+- **Stage 1 - 科学推理能力SFT**：基于涵盖**数理化天地生**等多学科的大规模多模态指令数据进行混合训练，提升模型科学视觉理解和逻辑推理能力，使模型在学术图像问答、医学影像分析、化学结构识别等方面奠定坚实基础；
+- **Stage 2 - 图像操作冷启动SFT**：引入 **Thinking with Images** 推理范式，混合高难度**科学推理课程学习数据**与图像思维数据进行联合训练，使模型具备在推理过程中通过代码进行**图像操作**的能力，尤其擅长解读密集科学图表、高分辨率遥感图像、显微图像及天文观测数据等复杂视觉场景（S1-VL-32B-SFT）；
+- **Stage 3 - 科学推理强化学习**：基于**SAPO算法**与多任务科学奖励函数，对困难科学多模态推理样本进行强化学习，突破SFT阶段的性能边界；
+- **Stage 4 - 图像操作强化学习**：基于**SAPO算法**与四维复合奖励函数，进一步优化模型的图像操作调用时机与质量，实现稳定高效的多轮视觉推理（S1-VL-32B-RL）。
+    <div align="center">
+    <img src="./image/s1-vl-training-pipeline.png"/>
+    </div>
+🔥 **[NEW]** 技术报告已发布：[S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)
+🔥 **[NEW]** 补充了 Stage3、Stage4 两阶段强化学习训练，更新了 [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) 模型权重。
 ## 📂 模型权重
 | 模型名称 | 参数量 | HuggingFace | ModelScope |
 |--------|------|-------------|------------|
+| S1-VL-32B-SFT | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
+| S1-VL-32B-RL | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |
 ## 🏆 评测结果
 ## 🧠 案例展示
+以下展示 S1-VL-32B 在 **Thinking with Images** 模式下的推理案例。S1-VL-32B 在处理一张低分辨率的颈部CT图像���思考过程中主动调用代码工具，对目标区域进行**裁剪与放大**，获取更清晰的局部图像后，再结合模型内部知识完成推理。
 <div align="center">
 <img src="./image/s1-vl-32b-twi.png"/>
     image_data = base64.b64encode(f.read()).decode("utf-8")
 response = client.chat.completions.create(
+    model="ScienceOne-AI/S1-VL-32B-RL",
     messages=[
         {
             "role": "user",
 ## 📄 引用
+如果您在研究中使用了 S1-VL-32B，欢迎引用：
 ```latex
+@article{li2026s1vl,
+  title     = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
+  author    = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
+  journal   = {arXiv preprint arXiv:2604.21409},
+  year      = {2026},
 }
 ```