baichuan-inc/Baichuan2-13B-Chat · load_in

Sep 6, 2023

I could load Baichuan version 1 in 8bit but cannot load version 2, has the following error:

ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

wuzhiying2023

Baichuan Intelligent Technology org Sep 6, 2023

Can you post your code?

BBLL3456

Sep 6, 2023

I used the web_demo.py on github and just added the load_in_8bit. I can load the version 2 with load_in_4bit

def init_model():
model = AutoModelForCausalLM.from_pretrained(
"./model/Baichuan2-13B-Chat",
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True,
trust_remote_code=True

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

I cannot reproduce your error. Did you pull the latest code?

XuWave

Sep 7, 2023

我在进行int8操作的时候同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

BBLL3456

Sep 7, 2023

•

edited Sep 7, 2023

I cannot reproduce your error. Did you pull the latest code?

Yes, it is the latest, including the change for 'BitsAndBytesConfig' object is not subscriptable.

I am not sure if it makes a difference, I downloaded the files locally and put them in ./model folder

@XuWave you need to download the latest modeling_baichuan.py
But there would still be an error for running 8bit, running 4bit is ok.

BBLL3456

Sep 7, 2023

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

XuWave

Sep 7, 2023

This comment has been hidden

XuWave

Sep 7, 2023

@BBLL3456 内存90GB，显存32GB

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

For int8, 13B-Chat will cost 14.2GiB memory.

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

我在进行int8操作的时候同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

The code is not latest.
"

--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):

BBLL3456

Sep 7, 2023

I think somehow this version may be taking much more memory to load the 8 bit than the Baichuan version 1. If you could confirm that, then it could be a memory issue.

For int8, 13B-Chat will cost 14.2GiB memory.

I have 16GB GPU 32GB RAM - can't load 8bit version 2.

BBLL3456

Sep 7, 2023

我在进行int8操作的时候同样遇到了这个问题

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('Baichuan2-13B-Chat', load_in_8bit=True, device_map="auto", trust_remote_code=True)
model.save_pretrained('Baichuan2-13B-Chat-int8')

File ~/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py:537, in BaichuanForCausalLM.init(self, config, *model_args, **model_kwargs)
535 self.model = BaichuanModel(config)
536 self.lm_head = NormHead(config.hidden_size, config.vocab_size, bias=False)
--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
538 try:
539 from .quantizer import quantize_offline, init_model_weight_int4

TypeError: 'BitsAndBytesConfig' object is not subscriptable

The code is not latest.
"

--> 537 if hasattr(config, "quantization_config") and config.quantization_config['load_in_4bit']:
"
is been changed to :
if hasattr(config, "quantization_config") and isinstance(config.quantization_config, dict) and config.quantization_config.get('load_in_4bit', False):

Yes I know, i was just replying to @XuWave . I am using the latest code. Like i said, i can run the 4bit with no issue and I am pretty sure the 13B v2 is drawing on more GPU than the v1. Could you please look at the codes of V1 and V2? I am using the same environment for both

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

Yes, V2 will use more memory than V1, The cause is mainly the serval factors below:

the vocab is 2x times than V1;
quantizer is mix-precision-8bits quantization-op;
If we take the fragmentization of gpu-memory into consideration, 15GB is not enough is possible.

BBLL3456

Sep 7, 2023

Sad, I can't run Baichuan2 8bit on my machine then...

BBLL3456

Sep 7, 2023

•

edited Sep 7, 2023

Well according to your Github page, it is supposed to be more efficient than Version 1, and only requires 14.2gb as opposed to 15.8GB in Version 1. By right i should be able to load onto my machine.

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

This comment has been hidden

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

I have no idea. On my machine, the memory usage is about 15241971712Bytes / 2**30 = 14.2GiB for 8bit-loading

Just now, I tested 13B-int8 gpu memory usage, nvidia-smi show 16.05GB, while invoking torch.cuda.max_allocated_memory(), we get 14.2GB. So there are other memory is used by the model which torch cannot get?

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

I guess some ops will use more additional memory. I have no idea on how to solve it.

BBLL3456

Sep 7, 2023

•

edited Sep 7, 2023

Is the model first loaded in fp32 instead of 16bit? It is the initial loading that caught the error.

I also saw some discussions on this same issue raised on your Github page.

I am pasting the entire error below:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 311, in _handle_cache_miss
cached_result = cache.read_result(value_key)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/user/baichuan2/web_demo.py", line 72, in
main()
File "/home/user/baichuan2/web_demo.py", line 51, in main
model, tokenizer = init_model()
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
return cached_func(*args, **kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 240, in call
return self._get_or_create_cached_value(args, kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
computed_value = self._info.func(*func_args, **func_kwargs)
File "/home/user/baichuan2/web_demo.py", line 13, in init_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/user/.cache/huggingface/modules/transformers_modules/baichuan-inc/Baichuan2-13B-Chat/670d17ee403f45334f53121d72feff623cc37de1/modeling_baichuan.py", line 669, in from_pretrained
return super(BaichuanForCausalLM, cls).from_pretrained(pretrained_model_name_or_path, *model_args,
File "/home/user/miniconda3/envs/baichuan2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3114, in from_pretrained
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

wuzhiying2023

Baichuan Intelligent Technology org Sep 7, 2023

Not really.

BBLL3456

Sep 7, 2023

Would you be able to provide an int-8 bit version?

wuzhiying2023

Baichuan Intelligent Technology org Sep 8, 2023

Would you be able to provide an int-8 bit version?

We have no plan to provide an int8 bit version by now

BBLL3456 changed discussion status to closed Sep 8, 2023

baichuan-inc
/

Baichuan2-13B-Chat

load_in_8bit error