Output is empty

#3
by bingw5 - opened

I am able to run the example_inference.py script, and the script ends successfully. However, the model didn't generate anything. See below:

(intern_clean)s:~/models/internvl-v1_5-4bit$ python example_inference.py
Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:03<00:00,  1.33it/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
dynamic ViT batch size: 7
<|im_start|>system
You are an AI assistant whose name is InternLM (ไนฆ็”Ÿยทๆตฆ่ฏญ).<|im_end|><|im_start|>user
<image>
What is shown in this image?<|im_end|><|im_start|>assistant

I found that this issue occurred during other people's testing either. What's going wrong here? @failspy

Yep, exact same issue here.

Has anyone actually got this to work? Having same issue here. Seems to use ~20GB VRAM on my 3090TI for 20 seconds or so then doesn't generate an actual response, just like OP posted.

Steps to reproduce

1. Inference Example Code

Modify the model usage code provided by the original model card OpenGVLab/InternVL-Chat-V1-5-Int8. Just enough to point it at this model path and your local test image e.g. a quick refactor for:

MODEL_PATH = "./models/failspy/InternVL-Chat-V1-5-4bit" # "OpenGVLab/InternVL-Chat-V1-5-Int8"
IMAGE_FILE = "./test.jpg"

.
.
.

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    load_in_8bit=True # <--- removed this line as it reads in the quantization_config kwargs from the config.json
).eval()

2. bitsandbytes quantization config

It will spit out some deprecation warnings as the provided config.json uses some deprecated quantization_config parameters. Doing some reading on 4bit quantization seems to imply @failspy attempted a 4-bit NormalFloat (NF4) quantization. For reference compare to the original 8bit quant config.json ??

this repo's config.json

  "quantization_config": {
    "_load_in_4bit": true,       # <- removed this line, seems deprecated
    "_load_in_8bit": false,     # <- removed this line seems deprecated
    "bnb_4bit_compute_dtype": "float32",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": false,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"    # <- this option not used so i removed it
  },

I really dunno what I'm doing, but maybe someone actually got this working and can provide an example inference script?

Thanks! Good luck everybody! ็ฅๅคงๅฎถๅฅฝ่ฟ!

Sign up or log in to comment