How to enable fp16 option in fintuning?

#45

by comet24082002 - opened Apr 13

comet24082002

Apr 13

I want to set --fp16 True for finetuning Bge M3 model to increase per_device_train_batch_size. However, when I ran, I got the error that is ValueError: Type fp16 is not supported.
Please help me!

Shitao

Beijing Academy of Artificial Intelligence org Apr 13

•

edited Apr 13

Which script did you use for fine-tuning? You can use our script following https://github.com/FlagOpen/FlagEmbedding/tree/master/examples .
Besides, fp16 is not supported by cpu.

comet24082002

Apr 13

•

edited Apr 13

Which script did you use for fine-tuning? You can use our script following https://github.com/FlagOpen/FlagEmbedding/tree/master/examples .
Besides, fp16 is not supported by cpu.

@Shitao I used your finetune example script like this:

Shitao

Beijing Academy of Artificial Intelligence org Apr 13

@comet24082002 , can you provide the detailed log for this error?

comet24082002

Apr 13

•

edited Apr 13

@comet24082002 , can you provide the detailed log for this error?

2024-04-13 13:46:26.452703: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-13 13:46:26.452810: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-13 13:46:26.595497: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2024-04-13 13:46:36,623] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-13 13:46:37,081] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-13 13:46:37,081] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
tokenizer_config.json: 100%|███████████████████| 444/444 [00:00<00:00, 2.29MB/s]
sentencepiece.bpe.model: 100%|█████████████| 5.07M/5.07M [00:00<00:00, 54.0MB/s]
special_tokens_map.json: 100%|█████████████████| 964/964 [00:00<00:00, 4.91MB/s]
tokenizer.json: 100%|███████████████████████| 17.1M/17.1M [00:00<00:00, 184MB/s]
config.json: 100%|█████████████████████████████| 687/687 [00:00<00:00, 3.85MB/s]
pytorch_model.bin: 100%|████████████████████| 2.27G/2.27G [00:10<00:00, 213MB/s]
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Generating train split: 5381 examples [00:00, 5492.29 examples/s]
/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/baai_general_embedding/finetune/run.py", line 111, in
main()
File "/opt/conda/lib/python3.10/site-packages/FlagEmbedding/baai_general_embedding/finetune/run.py", line 102, in main
trainer.train()
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1771, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1936, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 240, in init
self._do_sanity_check()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1040, in _do_sanity_check
raise ValueError("Type fp16 is not supported.")
ValueError: Type fp16 is not supported.
[2024-04-13 13:46:58,399] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 173) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.finetune.run FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2024-04-13_13:46:58
host : a8eb1bc79923
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@Shitao , this is the full log error that I got

Shitao

Beijing Academy of Artificial Intelligence org Apr 13

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

comet24082002

Apr 13

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

@Shitao , I used "!pip install -U deepspeed" but it can't solve the problem.

zzzzzz69

7 days ago

This error seems to be related to deepspeed. You can upgrade the deepspeed version to try it again.

@Shitao , I used "!pip install -U deepspeed" but it can't solve the problem.

I got the same error. Would you please share your solution? Many thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

How to enable fp16 option in fintuning?

FlagEmbedding.baai_general_embedding.finetune.run FAILED

Failures:

Root Cause (first observed failure):[0]: time : 2024-04-13_13:46:58 host : a8eb1bc79923 rank : 0 (local_rank: 0) exitcode : 1 (pid: 173) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-04-13_13:46:58
host : a8eb1bc79923
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html