OpenGVLab
/

InternVL2-40B

@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
 ## Quick Start
-We provide an example code to run InternVL2-40B using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
@@ -150,10 +150,6 @@ model = AutoModel.from_pretrained(
     trust_remote_code=True).eval()
 ```
-#### BNB 4-bit Quantization
-> **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
 #### Multiple GPUs
 The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
@@ -443,7 +439,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
                                num_patches_list=num_patches_list, history=None, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
-question = 'Describe this video in detail. Don\'t repeat.'
 response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                                num_patches_list=num_patches_list, history=history, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
@@ -502,7 +498,7 @@ from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
@@ -521,7 +517,7 @@ from lmdeploy.vl import load_image
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -543,7 +539,7 @@ from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -563,7 +559,7 @@ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
-pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
@@ -578,7 +574,7 @@ print(sess.response.text)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
-lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333
 ```
 To use the OpenAI-style interface, you need to install OpenAI:

 ## Quick Start
+We provide an example code to run `InternVL2-40B` using `transformers`.
 > Please use transformers>=4.37.2 to ensure the model works normally.
     trust_remote_code=True).eval()
 ```
 #### Multiple GPUs
 The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
                                num_patches_list=num_patches_list, history=None, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
+question = 'Describe this video in detail.'
 response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                                num_patches_list=num_patches_list, history=history, return_history=True)
 print(f'User: {question}\nAssistant: {response}')
 model = 'OpenGVLab/InternVL2-40B'
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
 response = pipe(('describe this image', image))
 print(response.text)
 ```
 from lmdeploy.vl.constants import IMAGE_TOKEN
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
 image_urls=[
     'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
 image_urls=[
     "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 from lmdeploy.vl import load_image
 model = 'OpenGVLab/InternVL2-40B'
+pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=2))
 image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
 gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
+lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port 23333 --tp 2
 ```
 To use the OpenAI-style interface, you need to install OpenAI: