czczup commited on
Commit
9c42ea1
·
verified ·
1 Parent(s): 42b7cef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -17
README.md CHANGED
@@ -115,7 +115,7 @@ Limitations: Although we have made efforts to ensure the safety of the model dur
115
 
116
  ## Quick Start
117
 
118
- We provide an example code to run InternVL2-8B using `transformers`.
119
 
120
  > Please use transformers>=4.37.2 to ensure the model works normally.
121
 
@@ -150,21 +150,6 @@ model = AutoModel.from_pretrained(
150
  trust_remote_code=True).eval()
151
  ```
152
 
153
- #### BNB 4-bit Quantization
154
-
155
- ```python
156
- import torch
157
- from transformers import AutoTokenizer, AutoModel
158
- path = "OpenGVLab/InternVL2-8B"
159
- model = AutoModel.from_pretrained(
160
- path,
161
- torch_dtype=torch.bfloat16,
162
- load_in_4bit=True,
163
- low_cpu_mem_usage=True,
164
- use_flash_attn=True,
165
- trust_remote_code=True).eval()
166
- ```
167
-
168
  #### Multiple GPUs
169
 
170
  The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
@@ -423,7 +408,7 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
423
  num_patches_list=num_patches_list, history=None, return_history=True)
424
  print(f'User: {question}\nAssistant: {response}')
425
 
426
- question = 'Describe this video in detail. Don\'t repeat.'
427
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
428
  num_patches_list=num_patches_list, history=history, return_history=True)
429
  print(f'User: {question}\nAssistant: {response}')
 
115
 
116
  ## Quick Start
117
 
118
+ We provide an example code to run `InternVL2-8B` using `transformers`.
119
 
120
  > Please use transformers>=4.37.2 to ensure the model works normally.
121
 
 
150
  trust_remote_code=True).eval()
151
  ```
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  #### Multiple GPUs
154
 
155
  The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
 
408
  num_patches_list=num_patches_list, history=None, return_history=True)
409
  print(f'User: {question}\nAssistant: {response}')
410
 
411
+ question = 'Describe this video in detail.'
412
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
413
  num_patches_list=num_patches_list, history=history, return_history=True)
414
  print(f'User: {question}\nAssistant: {response}')