Upload 3 files
Browse files- CASES.md +7 -17
- README.md +35 -21
- README_zh.md +27 -14
CASES.md
CHANGED
|
@@ -1490,14 +1490,13 @@ $$\boxed{(111)}$$
|
|
| 1490 |
|
| 1491 |
```
|
| 1492 |
<think>
|
| 1493 |
-
To identify the objects predominantly visible in the residential area,
|
| 1494 |
|
| 1495 |
-
The goal is to
|
| 1496 |
</think>
|
| 1497 |
|
| 1498 |
<tool_call>
|
| 1499 |
-
{"name": "python", "arguments": {"code": "import cv2\nimport os\nimport numpy as np\nfrom uuid import uuid4\n\n# Load the image\nimage_path = \"/mnt/data/images/P2504_0017.png\"\nimage = cv2.imread(image_path)\n\n# Define the coordinates for
|
| 1500 |
-
</tool_call>
|
| 1501 |
```
|
| 1502 |
|
| 1503 |
|
|
@@ -1512,19 +1511,12 @@ The goal is to analyze the residential area in detail. To improve visibility, I
|
|
| 1512 |
|
| 1513 |
```
|
| 1514 |
<think>
|
| 1515 |
-
The goal of the task is to identify the
|
| 1516 |
-
|
| 1517 |
-
### Step 1: Cropping and Zooming
|
| 1518 |
-
I will crop the image to focus on the residential area and then apply a zoom effect to enhance the details. This will help in identifying the objects more clearly.
|
| 1519 |
|
| 1520 |
-
|
| 1521 |
-
After cropping and zooming, I will analyze the image to identify the predominant objects in the residential area. The cropped image will likely show houses, trees, and possibly other structures or features typical of a residential neighborhood.
|
| 1522 |
-
|
| 1523 |
-
### Step 3: Final Answer
|
| 1524 |
-
Based on the analysis of the cropped and zoomed image, I will provide a detailed answer to the user's question about the predominant objects in the residential area.
|
| 1525 |
</think>
|
| 1526 |
|
| 1527 |
-
In the residential area,
|
| 1528 |
```
|
| 1529 |
|
| 1530 |
|
|
@@ -1646,9 +1638,7 @@ After re-evaluating the image, it's evident that the blue tag with the number is
|
|
| 1646 |
|
| 1647 |
```
|
| 1648 |
<think>
|
| 1649 |
-
|
| 1650 |
-
|
| 1651 |
-
The goal is to extract the number from the blue tag, which is located near a hanging plant. To improve visibility, I will crop the region around the tag and zoom in. Based on the provided image, the tag is located near the center-left of the image, partially hidden by flowers. I will refine this region to increase clarity and focus on the relevant portion of the image.
|
| 1652 |
</think>
|
| 1653 |
|
| 1654 |
D
|
|
|
|
| 1490 |
|
| 1491 |
```
|
| 1492 |
<think>
|
| 1493 |
+
To identify the objects predominantly visible in the residential area, it\'s essential to focus on the specific region of the image that contains the residential area. Cropping and zooming are critical because they allow us to isolate the relevant portion of the image, reducing distractions and enhancing the visibility of the objects we are interested in. In this case, the residential area is located in the lower right section of the image, and by cropping this region, we can better observe the details of the houses, roads, and other structures.
|
| 1494 |
|
| 1495 |
+
The goal of the task is to identify the objects predominantly visible in the residential area. To improve visibility, I will crop the region around the residential area and zoom in. Based on the provided image, the residential area is located in the lower right section. I will refine this region to increase clarity and focus on the relevant portion of the image.
|
| 1496 |
</think>
|
| 1497 |
|
| 1498 |
<tool_call>
|
| 1499 |
+
{"name": "python", "arguments": {"code": "import cv2\\nimport os\\nimport numpy as np\\nfrom uuid import uuid4\\n\\n# Load the image\\nimage_path = \\"/mnt/data/images/P2504_0017.png\\"\\nimage = cv2.imread(image_path)\\n\\n# Define the coordinates for the residential area (inferred from the image)\\nx1, y1, x2, y2 = 256, 320, 512, 512\\n\\n# Crop the image\\ncropped_image = image[y1:y2, x1:x2]\\n\\n# Apply a zoom effect by resizing the cropped image\\nzoom_factor = 2\\nzoomed_image = cv2.resize(cropped_image, None, fx=zoom_factor, fy=zoom_factor, interpolation=cv2.INTER_LINEAR)\\n\\n# Save the processed image with a random suffix\\ntemp_dir = \\"/mnt/data/images/temp/\\"\\nrandom_suffix = uuid4().hex\\nfilename = os.path.basename(image_path)\\nprocessed_filename = f\\"{os.path.splitext(filename)[0]}_{random_suffix}{os.path.splitext(filename)[1]}\\"\\nprocessed_path = os.path.join(temp_dir, processed_filename)\\ncv2.imwrite(processed_path, zoomed_image)\\n\\n# Print the path of the saved cropped image\\nprint(processed_path)"}}\n</tool_call>
|
|
|
|
| 1500 |
```
|
| 1501 |
|
| 1502 |
|
|
|
|
| 1511 |
|
| 1512 |
```
|
| 1513 |
<think>
|
| 1514 |
+
The goal of the task is to identify the objects predominantly visible in the residential area of the image. To achieve this, I first cropped the image to focus on the residential area, which is located in the lower right portion of the image. However, the initial crop did not effectively highlight the residential area, as it included a larger portion of the image that was not relevant. Therefore, I refined the crop to focus more precisely on the residential area.
|
|
|
|
|
|
|
|
|
|
| 1515 |
|
| 1516 |
+
After applying the refined crop, the image now clearly shows the residential area. The objects predominantly visible in this area include houses, roads, and some greenery. The houses are arranged in a grid-like pattern, with roads connecting them. The greenery includes trees and grassy areas, which are interspersed among the houses. This refined view allows for a more accurate analysis of the objects in the residential area.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1517 |
</think>
|
| 1518 |
|
| 1519 |
+
In the residential area, the objects predominantly visible are houses, roads, and some greenery. The houses are arranged in a grid-like pattern, with roads connecting them. The greenery includes trees and grassy areas, which are interspersed among the houses.
|
| 1520 |
```
|
| 1521 |
|
| 1522 |
|
|
|
|
| 1638 |
|
| 1639 |
```
|
| 1640 |
<think>
|
| 1641 |
+
The goal of the task is to identify the number written on the blue tag. To achieve this, I first cropped the image based on the provided bounding box. However, upon reviewing the cropped image, it became clear that the blue tag was not prominently visible in this region. The cropped area primarily focused on the wall and some plants, which did not include the blue tag. Therefore, I refined the crop further to focus specifically on the blue tag.\n\nAfter applying the second crop, the blue tag is now clearly visible. The number written on the blue tag is "23". Therefore, the correct answer is D.
|
|
|
|
|
|
|
| 1642 |
</think>
|
| 1643 |
|
| 1644 |
D
|
README.md
CHANGED
|
@@ -1,27 +1,42 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
| 4 |
# S1-VL-32B: Scientific Multimodal Reasoning Model
|
| 5 |
|
| 6 |
[中文版](./README_zh.md) | [English](./README.md)
|
| 7 |
|
| 8 |
## 🔬 Introduction
|
| 9 |
|
| 10 |
-
**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — **
|
| 11 |
|
| 12 |
-
- **
|
| 13 |
-
- **Thinking with Images
|
| 14 |
|
| 15 |
-
We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
- **Stage 1**: Large-scale multimodal instruction data spanning multiple disciplines — including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
|
| 18 |
-
- **Stage 2**: The **Thinking with Images** reasoning paradigm is introduced. Through high-quality **scientific reasoning data annealing**, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in scenarios requiring fine-grained image analysis, with notable strengths in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and complex visual scenes such as astronomical observation data.
|
| 19 |
|
| 20 |
## 📂 Model Weights
|
| 21 |
|
| 22 |
| Model | Parameters | HuggingFace | ModelScope |
|
| 23 |
|-------|-----------|-------------|------------|
|
| 24 |
-
| S1-VL-32B | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## 🏆 Evaluation Results
|
| 27 |
|
|
@@ -56,7 +71,7 @@ pip install qwen-vl-utils==0.0.14
|
|
| 56 |
### 2. Start the vLLM Service
|
| 57 |
|
| 58 |
```bash
|
| 59 |
-
vllm serve ScienceOne-AI/S1-VL-32B \
|
| 60 |
--tensor-parallel-size 4 \
|
| 61 |
--max-model-len 32768 \
|
| 62 |
--limit-mm-per-prompt image=15 \
|
|
@@ -78,7 +93,7 @@ with open("path/to/your/image.png", "rb") as f:
|
|
| 78 |
image_data = base64.b64encode(f.read()).decode("utf-8")
|
| 79 |
|
| 80 |
response = client.chat.completions.create(
|
| 81 |
-
model="ScienceOne-AI/S1-VL-32B",
|
| 82 |
messages=[
|
| 83 |
{
|
| 84 |
"role": "user",
|
|
@@ -88,8 +103,7 @@ response = client.chat.completions.create(
|
|
| 88 |
],
|
| 89 |
}
|
| 90 |
],
|
| 91 |
-
temperature=0.
|
| 92 |
-
top_p=0.95,
|
| 93 |
max_tokens=16384,
|
| 94 |
)
|
| 95 |
|
|
@@ -167,14 +181,14 @@ print(final["content"])
|
|
| 167 |
|
| 168 |
## 📄 Citation
|
| 169 |
|
| 170 |
-
If you use S1-VL-32B in your research, please cite
|
| 171 |
|
| 172 |
```latex
|
| 173 |
-
@
|
| 174 |
-
title
|
| 175 |
-
author
|
| 176 |
-
|
| 177 |
-
|
| 178 |
}
|
| 179 |
```
|
| 180 |
|
|
@@ -184,4 +198,4 @@ This project is released under the Apache 2.0 License.
|
|
| 184 |
|
| 185 |
## 🙏 Acknowledgements
|
| 186 |
|
| 187 |
-
We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# S1-VL-32B: Scientific Multimodal Reasoning Model
|
| 2 |
|
| 3 |
[中文版](./README_zh.md) | [English](./README.md)
|
| 4 |
|
| 5 |
## 🔬 Introduction
|
| 6 |
|
| 7 |
+
**S1-VL-32B** is a multimodal large language model for scientific domains, developed by the ScienceOne AI team at the Chinese Academy of Sciences. It natively supports two reasoning paradigms — **Scientific Reasoning** and **Thinking with Images** — and achieves state-of-the-art performance across multiple mainstream scientific multimodal evaluation benchmarks.
|
| 8 |
|
| 9 |
+
- **Scientific Reasoning**: Chain-of-thought-based multimodal scientific reasoning, designed for the analysis and solving of complex, multi-step problems.
|
| 10 |
+
- **Thinking with Images**: Enables the model to actively invoke code tools during the reasoning process to perform image operations — including cropping, zooming, image enhancement, bounding box annotation, and keypoint marking — before generating responses.
|
| 11 |
|
| 12 |
+
We have established a **cross-disciplinary data processing pipeline** that conducts multi-dimensional utility evaluation and filtering of visual reasoning trajectories to ensure the quality of training reasoning trajectories.
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="./image/data_pipeline.png"/>
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
We adopt a **four-stage progressive post-training procedure** to progressively unlock the scientific reasoning capabilities of S1-VL-32B:
|
| 19 |
+
|
| 20 |
+
- **Stage 1 - Scientific Reasoning SFT**: Large-scale multimodal instruction data spanning multiple disciplines — including **mathematics, physics, chemistry, astronomy, earth sciences, and biology** — is used for mixed training to enhance the model's scientific visual understanding and logical reasoning abilities, laying a solid foundation for academic figure Q&A, medical image analysis, chemical structure recognition, and related tasks.
|
| 21 |
+
- **Stage 2 - Thinking-with-Images Cold-Start SFT**: The **Thinking with Images** reasoning paradigm is introduced. Through joint training with high-quality **scientific reasoning curriculum learning data** and image-thinking data, the model acquires the ability to perform **image operations via code** during inference. This approach yields particularly outstanding performance in interpreting dense scientific charts, high-resolution remote sensing imagery, microscopic images, and astronomical observation data (S1-VL-32B-SFT).
|
| 22 |
+
- **Stage 3 - Scientific Reasoning RL**: Based on the **SAPO algorithm** and a multi-task scientific reward function, reinforcement learning is applied to challenging scientific multimodal reasoning samples to push beyond the performance ceiling of the SFT stage.
|
| 23 |
+
- **Stage 4 - Thinking-with-Images RL**: Based on the **SAPO algorithm** and a four-dimensional composite reward function, the model's image operation invocation timing and quality are further optimized, enabling stable and efficient multi-round visual reasoning (S1-VL-32B-RL).
|
| 24 |
+
|
| 25 |
+
<div align="center">
|
| 26 |
+
<img src="./image/s1-vl-training-pipeline.png"/>
|
| 27 |
+
</div>
|
| 28 |
+
|
| 29 |
+
🔥 **[NEW]** Technical report released: [S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)
|
| 30 |
+
🔥 **[NEW]** Stage 3 and Stage 4 reinforcement learning training added; [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) model weights updated.
|
| 31 |
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## 📂 Model Weights
|
| 34 |
|
| 35 |
| Model | Parameters | HuggingFace | ModelScope |
|
| 36 |
|-------|-----------|-------------|------------|
|
| 37 |
+
| S1-VL-32B-SFT | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
|
| 38 |
+
| S1-VL-32B-RL | 32B | 🤗 [Download](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | 🤖 [Download](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |
|
| 39 |
+
|
| 40 |
|
| 41 |
## 🏆 Evaluation Results
|
| 42 |
|
|
|
|
| 71 |
### 2. Start the vLLM Service
|
| 72 |
|
| 73 |
```bash
|
| 74 |
+
vllm serve ScienceOne-AI/S1-VL-32B-RL \
|
| 75 |
--tensor-parallel-size 4 \
|
| 76 |
--max-model-len 32768 \
|
| 77 |
--limit-mm-per-prompt image=15 \
|
|
|
|
| 93 |
image_data = base64.b64encode(f.read()).decode("utf-8")
|
| 94 |
|
| 95 |
response = client.chat.completions.create(
|
| 96 |
+
model="ScienceOne-AI/S1-VL-32B-RL",
|
| 97 |
messages=[
|
| 98 |
{
|
| 99 |
"role": "user",
|
|
|
|
| 103 |
],
|
| 104 |
}
|
| 105 |
],
|
| 106 |
+
temperature=0.2,
|
|
|
|
| 107 |
max_tokens=16384,
|
| 108 |
)
|
| 109 |
|
|
|
|
| 181 |
|
| 182 |
## 📄 Citation
|
| 183 |
|
| 184 |
+
If you use S1-VL-32B in your research, please cite:
|
| 185 |
|
| 186 |
```latex
|
| 187 |
+
@article{li2026s1vl,
|
| 188 |
+
title = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
|
| 189 |
+
author = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
|
| 190 |
+
journal = {arXiv preprint arXiv:2604.21409},
|
| 191 |
+
year = {2026},
|
| 192 |
}
|
| 193 |
```
|
| 194 |
|
|
|
|
| 198 |
|
| 199 |
## 🙏 Acknowledgements
|
| 200 |
|
| 201 |
+
We thank the open-source communities and pioneering works of [Qwen3-VL](https://modelscope.cn/collections/Qwen3-VL-5c7a94c8cb144b) and [AIO Sandbox](https://github.com/agent-infra/sandbox) for laying the foundation for the scientific multimodal reasoning research behind S1-VL-32B.
|
README_zh.md
CHANGED
|
@@ -3,20 +3,33 @@
|
|
| 3 |
[中文版](./README_zh.md) | [English](./README.md)
|
| 4 |
|
| 5 |
## 🔬 模型简介
|
| 6 |
-
**S1-VL-32B** 是由中国科学院 “磐石 · 科学基础大模型” ScienceOne 团队研发的面向科学领域的多模态大语言模型,原生支持 **
|
| 7 |
-
- **
|
| 8 |
- **Thinking with Images 模式**:允许模型在推理过程中在思考过程中主动调用代码工具进行图像操作(包括裁剪、放缩、图像增强、画框标注、描点标记等)再生成回答。
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
|
| 15 |
## 📂 模型权重
|
| 16 |
|
| 17 |
| 模型名称 | 参数量 | HuggingFace | ModelScope |
|
| 18 |
|--------|------|-------------|------------|
|
| 19 |
-
| S1-VL-32B | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
|
|
|
|
|
|
|
| 20 |
|
| 21 |
|
| 22 |
## 🏆 评测结果
|
|
@@ -31,7 +44,7 @@ S1-VL-32B 在上述评测中展现出突出的综合竞争力。在**科学多
|
|
| 31 |
|
| 32 |
## 🧠 案例展示
|
| 33 |
|
| 34 |
-
以下展示 S1-VL-32B 在 **Thinking with Images** 模式下的推理案例。S1-VL-32B在处理一张低分辨率的颈部CT图像
|
| 35 |
|
| 36 |
<div align="center">
|
| 37 |
<img src="./image/s1-vl-32b-twi.png"/>
|
|
@@ -71,7 +84,7 @@ with open("path/to/your/image.png", "rb") as f:
|
|
| 71 |
image_data = base64.b64encode(f.read()).decode("utf-8")
|
| 72 |
|
| 73 |
response = client.chat.completions.create(
|
| 74 |
-
model="ScienceOne-AI/S1-VL-32B",
|
| 75 |
messages=[
|
| 76 |
{
|
| 77 |
"role": "user",
|
|
@@ -159,13 +172,13 @@ print(final["content"])
|
|
| 159 |
|
| 160 |
## 📄 引用
|
| 161 |
|
| 162 |
-
如果您在研究中使用了 S1-VL-32B,欢迎引用
|
| 163 |
```latex
|
| 164 |
-
@
|
| 165 |
-
title
|
| 166 |
-
author
|
| 167 |
-
|
| 168 |
-
|
| 169 |
}
|
| 170 |
```
|
| 171 |
|
|
|
|
| 3 |
[中文版](./README_zh.md) | [English](./README.md)
|
| 4 |
|
| 5 |
## 🔬 模型简介
|
| 6 |
+
**S1-VL-32B** 是由中国科学院 “磐石 · 科学基础大模型” ScienceOne AI 团队研发的面向科学领域的多模态大语言模型,原生支持 **Scientific Reasoning(多模态推理)** 与 **Thinking with Images(图像思考)** 两种推理范式,在多项主流科学多模态评测基准上达到当前最优水平。
|
| 7 |
+
- **Scientific Reasoning 模式**:基于思维链的科学多模态推理,适用于复杂多步问题的分析与求解。
|
| 8 |
- **Thinking with Images 模式**:允许模型在推理过程中在思考过程中主动调用代码工具进行图像操作(包括裁剪、放缩、图像增强、画框标注、描点标记等)再生成回答。
|
| 9 |
|
| 10 |
+
|
| 11 |
+
我们建立**跨学科体系的数据处理管道**对视觉推理轨迹进行多维度效用评估与筛选,确保训练推理轨迹的质量。我们采用**四阶段渐进式后训练流程**逐步解锁 S1-VL-32B 模型的科学推理能力:
|
| 12 |
+
|
| 13 |
+
- **Stage 1 - 科学推理能力SFT**:基于涵盖**数理化天地生**等多学科的大规模多模态指令数据进行混合训练,提升模型科学视觉理解和逻辑推理能力,使模型在学术图像问答、医学影像分析、化学结构识别等方面奠定坚实基础;
|
| 14 |
+
- **Stage 2 - 图像操作冷启动SFT**:引入 **Thinking with Images** 推理范式,混合高难度**科学推理课程学习数据**与图像思维数据进行联合训练,使模型具备在推理过程中通过代码进行**图像操作**的能力,尤其擅长解读密集科学图表、高分辨率遥感图像、显微图像及天文观测数据等复杂视觉场景(S1-VL-32B-SFT);
|
| 15 |
+
- **Stage 3 - 科学推理强化学习**:基于**SAPO算法**与多任务科学奖励函数,对困难科学多模态推理样本进行强化学习,突破SFT阶段的性能边界;
|
| 16 |
+
- **Stage 4 - 图像操作强化学习**:基于**SAPO算法**与四维复合奖励函数,进一步优化模型的图像操作调用时机与质量,实现稳定高效的多轮视觉推理(S1-VL-32B-RL)。
|
| 17 |
+
|
| 18 |
+
<div align="center">
|
| 19 |
+
<img src="./image/s1-vl-training-pipeline.png"/>
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
🔥 **[NEW]** 技术报告已发布:[S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images](https://arxiv.org/abs/2604.21409)
|
| 23 |
+
🔥 **[NEW]** 补充了 Stage3、Stage4 两阶段强化学习训练,更新了 [S1-VL-32B-RL](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) 模型权重。
|
| 24 |
|
| 25 |
|
| 26 |
## 📂 模型权重
|
| 27 |
|
| 28 |
| 模型名称 | 参数量 | HuggingFace | ModelScope |
|
| 29 |
|--------|------|-------------|------------|
|
| 30 |
+
| S1-VL-32B-SFT | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B) |
|
| 31 |
+
| S1-VL-32B-RL | 32B | 🤗 [下载](https://huggingface.co/ScienceOne-AI/S1-VL-32B-RL) | 🤖 [下载](https://modelscope.cn/models/ScienceOne-AI/S1-VL-32B-RL) |
|
| 32 |
+
|
| 33 |
|
| 34 |
|
| 35 |
## 🏆 评测结果
|
|
|
|
| 44 |
|
| 45 |
## 🧠 案例展示
|
| 46 |
|
| 47 |
+
以下展示 S1-VL-32B 在 **Thinking with Images** 模式下的推理案例。S1-VL-32B 在处理一张低分辨率的颈部CT图像���思考过程中主动调用代码工具,对目标区域进行**裁剪与放大**,获取更清晰的局部图像后,再结合模型内部知识完成推理。
|
| 48 |
|
| 49 |
<div align="center">
|
| 50 |
<img src="./image/s1-vl-32b-twi.png"/>
|
|
|
|
| 84 |
image_data = base64.b64encode(f.read()).decode("utf-8")
|
| 85 |
|
| 86 |
response = client.chat.completions.create(
|
| 87 |
+
model="ScienceOne-AI/S1-VL-32B-RL",
|
| 88 |
messages=[
|
| 89 |
{
|
| 90 |
"role": "user",
|
|
|
|
| 172 |
|
| 173 |
## 📄 引用
|
| 174 |
|
| 175 |
+
如果您在研究中使用了 S1-VL-32B,欢迎引用:
|
| 176 |
```latex
|
| 177 |
+
@article{li2026s1vl,
|
| 178 |
+
title = {S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images},
|
| 179 |
+
author = {Li, Qingxiao and Xu, Lifeng and Wang, QingLi and Bai, Yudong and Ou, Mingwei and Hu, Shu and Xu, Nan},
|
| 180 |
+
journal = {arXiv preprint arXiv:2604.21409},
|
| 181 |
+
year = {2026},
|
| 182 |
}
|
| 183 |
```
|
| 184 |
|