OpenGVLab
/

InternVL2-40B

@@ -62,6 +62,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
 | MathVista<sub>testmini</sub> |      58.1       |      57.7      |     59.4      |     63.7      |
 |  OpenCompass<sub>avg</sub>   |      63.5       |      64.4      |     66.4      |     69.7      |
 - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
 - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
@@ -321,7 +323,7 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
-generation_config = dict(max_new_tokens=1024, do_sample=False)
 # pure-text conversation (纯文本对话)
 question = 'Hello, who are you?'
@@ -473,7 +475,7 @@ for new_text in streamer:
 ## Finetune
-SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
 ## Deployment
@@ -482,7 +484,7 @@ SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of I
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```sh
-pip install lmdeploy
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
@@ -490,16 +492,12 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
 #### A 'Hello, world' example
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
@@ -513,16 +511,12 @@ When dealing with multiple images, you can put them all in one list. Keep in min
 > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -540,15 +534,11 @@ print(response.text)
 Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -564,15 +554,11 @@ print(response)
 There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -584,20 +570,10 @@ print(sess.response.text)
 #### Service
-To deploy InternVL2 as an API, please configure the chat template config first. Create the following JSON file `chat_template.json`.
-```json
-{
-    "model_name":"internvl-zh-hermes2",
-    "meta_instruction":"我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。",
-    "stop_words":["<|im_start|>", "<|im_end|>"]
-}
-```
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
-lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333 --chat-template chat_template.json
 ```
 To use the OpenAI-style interface, you need to install OpenAI:
@@ -634,14 +610,6 @@ response = client.chat.completions.create(
 print(response)
 ```
-### vLLM
-TODO
-### Ollama
-TODO
 ## License
 This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
@@ -714,6 +682,8 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 | MathVista<sub>testmini</sub> |      58.1       |      57.7      |     59.4      |     63.7      |
 |  OpenCompass<sub>avg</sub>   |      63.5       |      64.4      |     66.4      |     69.7      |
 - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说，DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
 - 对于MMMU，我们报告了原始分数（左侧：InternVL系列模型使用InternVL代码库评测，其他模型的分数来自其技术报告或网页）和VLMEvalKit分数（右侧：从OpenCompass排行榜收集）。
@@ -772,7 +742,7 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 ## 微调
-来自ModelScope社区的SWIFT已经支持对InternVL进行微调（图像/视频），详情请查看[此链接](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md)。
 ## 部署
@@ -781,7 +751,7 @@ InternVL 2.0 是一个多模态大语言模型系列，包含各种规模的模
 LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型（LLM）的工具包。
 ```sh
-pip install lmdeploy
 ```
 LMDeploy 将��模态视觉-语言模型（VLM）的复杂推理过程抽象为一个易于使用的管道，类似于大语言模型（LLM）的推理管道。
@@ -789,16 +759,12 @@ LMDeploy 将多模态视觉-语言模型（VLM）的复杂推理过程抽象为
 #### 一个“你好，世界”示例
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
@@ -810,16 +776,12 @@ print(response.text)
 在处理多张图像时，可以将它们全部放入一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增加上下文窗口的大小。
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -827,6 +789,7 @@ image_urls=[
 ]
 images = [load_image(img_url) for img_url in image_urls]
 response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 print(response.text)
 ```
@@ -836,15 +799,11 @@ print(response.text)
 使用批量Prompt进行推理非常简单；只需将它们放在一个列表结构中：
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -860,15 +819,11 @@ print(response)
 使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法，另一种是使用 `pipeline.chat` 接口。
 ```python
-from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-system_prompt = '我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。'
-chat_template_config = ChatTemplateConfig('internvl-zh-hermes2')
-chat_template_config.meta_instruction = system_prompt
-pipe = pipeline(model, chat_template_config=chat_template_config,
-                backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -880,20 +835,10 @@ print(sess.response.text)
 #### API部署
-为了将InternVL2部署成API，请先配置聊天模板配置文件。创建如下的 JSON 文件 `chat_template.json`。
-```json
-{
-    "model_name":"internvl-zh-hermes2",
-    "meta_instruction":"我是书生·万象，英文名是InternVL，是由上海人工智能实验室、清华大学及多家合作��位联合开发的多模态大语言模型。",
-    "stop_words":["<|im_start|>", "<|im_end|>"]
-}
-```
 LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例：
 ```shell
-lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333 --chat-template chat_template.json
 ```
 为了使用OpenAI风格的API接口，您需要安装OpenAI:
@@ -930,14 +875,6 @@ response = client.chat.completions.create(
 print(response)
 ```
-### vLLM
-TODO
-### Ollama
-TODO
 ## 开源许可证
 该项目采用 MIT 许可证发布，而 InternLM2 则采用 Apache-2.0 许可证。

 | MathVista<sub>testmini</sub> |      58.1       |      57.7      |     59.4      |     63.7      |
 |  OpenCompass<sub>avg</sub>   |      63.5       |      64.4      |     66.4      |     69.7      |
+- For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
 - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
 - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
 # set the max number of tiles in `max_num`
 pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
+generation_config = dict(max_new_tokens=1024, do_sample=True)
 # pure-text conversation (纯文本对话)
 question = 'Hello, who are you?'
 ## Finetune
+Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
 ## Deployment
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```sh
+pip install lmdeploy==0.5.3
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
 #### A 'Hello, world' example
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
 > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 #### Service
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
+lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333
 ```
 To use the OpenAI-style interface, you need to install OpenAI:
 print(response)
 ```
 ## License
 This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
 | MathVista<sub>testmini</sub> |      58.1       |      57.7      |     59.4      |     63.7      |
 |  OpenCompass<sub>avg</sub>   |      63.5       |      64.4      |     66.4      |     69.7      |
+- 关于更多的细节以及评测复现，请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
 - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说，DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
 - 对于MMMU，我们报告了原始分数（左侧：InternVL系列模型使用InternVL代码库评测，其他模型的分数来自其技术报告或网页）和VLMEvalKit分数（右侧：从OpenCompass排行榜收集）。
 ## 微调
+许多仓库现在都支持 InternVL 系列模型的微调，包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
 ## 部署
 LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型（LLM）的工具包。
 ```sh
+pip install lmdeploy==0.5.3
 ```
 LMDeploy 将��模态视觉-语言模型（VLM）的复杂推理过程抽象为一个易于使用的管道，类似于大语言模型（LLM）的推理管道。
 #### 一个“你好，世界”示例
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
 在处理多张图像时，可以将它们全部放入一个列表中。请注意，多张图像会导致输入 token 数量增加，因此通常需要增加上下文窗口的大小。
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 ]
 images = [load_image(img_url) for img_url in image_urls]
+# Numbering images improves multi-image conversations
 response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 print(response.text)
 ```
 使用批量Prompt进行推理非常简单；只需将它们放在一个列表结构中：
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法，另一种是使用 `pipeline.chat` 接口。
 ```python
+from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 #### API部署
 LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例：
 ```shell
+lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333
 ```
 为了使用OpenAI风格的API接口，您需要安装OpenAI:
 print(response)
 ```
 ## 开源许可证
 该项目采用 MIT 许可证发布，而 InternLM2 则采用 Apache-2.0 许可证。

modeling_intern_vit.py CHANGED Viewed

@@ -20,18 +20,12 @@ from transformers.utils import logging
 from .configuration_intern_vit import InternVisionConfig
 try:
-    try:  # v1
-        from flash_attn.flash_attn_interface import \
-            flash_attn_unpadded_qkvpacked_func
-    except:  # v2
-        from flash_attn.flash_attn_interface import \
-            flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
     from flash_attn.bert_padding import pad_input, unpad_input
     has_flash_attn = True
 except:
-    print('FlashAttention is not installed.')
     has_flash_attn = False
 logger = logging.get_logger(__name__)
@@ -74,7 +68,7 @@ class FlashAttention(nn.Module):
                 max_s = seqlen
                 cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
                                           device=qkv.device)
-                output = flash_attn_unpadded_qkvpacked_func(
                     qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                     softmax_scale=self.softmax_scale, causal=causal
                 )
@@ -84,7 +78,7 @@ class FlashAttention(nn.Module):
                 x = rearrange(qkv, 'b s three h d -> b s (three h d)')
                 x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
                 x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
-                output_unpad = flash_attn_unpadded_qkvpacked_func(
                     x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                     softmax_scale=self.softmax_scale, causal=causal
                 )
@@ -93,7 +87,7 @@ class FlashAttention(nn.Module):
                                    'b s (h d) -> b s h d', h=nheads)
         else:
             assert max_s is not None
-            output = flash_attn_unpadded_qkvpacked_func(
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )

 from .configuration_intern_vit import InternVisionConfig
 try:
     from flash_attn.bert_padding import pad_input, unpad_input
+    from flash_attn.flash_attn_interface import \
+        flash_attn_varlen_qkvpacked_func
     has_flash_attn = True
 except:
+    print('FlashAttention2 is not installed.')
     has_flash_attn = False
 logger = logging.get_logger(__name__)
                 max_s = seqlen
                 cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
                                           device=qkv.device)
+                output = flash_attn_varlen_qkvpacked_func(
                     qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                     softmax_scale=self.softmax_scale, causal=causal
                 )
                 x = rearrange(qkv, 'b s three h d -> b s (three h d)')
                 x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
                 x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
+                output_unpad = flash_attn_varlen_qkvpacked_func(
                     x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                     softmax_scale=self.softmax_scale, causal=causal
                 )
                                    'b s (h d) -> b s h d', h=nheads)
         else:
             assert max_s is not None
+            output = flash_attn_varlen_qkvpacked_func(
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )