OSError: google/flan-t5-xl does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

#19
by anirudhmittal - opened

I am getting this error while loading the model. It seems like my current version of transformers is not compatible to handle sharded models. Is there a work around other than upgrading transformers?

I am having a similar problem.
Here is the log:

Loading checkpoint shards:   0%|                                                                                                                                                                                 | 0/2 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 460, in load_state_dict
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 460, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
    return torch.load(checkpoint_file, map_location="cpu")    return torch.load(checkpoint_file, map_location="cpu")

      File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu")      File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu")

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
    return torch.load(checkpoint_file, map_location="cpu")
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 36, in _inner_patch
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
    raise ex
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 34, in _inner_patch
    result = unpickler.load()
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
    result = unpickler.load()        
result = unpickler.load()result = unpickler.load()  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load


  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
        result = unpickler.load()
ret = patched_fn(original_fn, *args, **kwargs)  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/pytorch_bind.py", line 279, in _load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
        typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
      File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
        model = original_fn(f, *args, **kwargs)typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
OSError: [Errno 14] Bad address  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 30, in _inner_patch
: [Errno 14] Bad address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    if f.read(7) == "version":
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
        if f.read(7) == "version":
if f.read(7) == "version":
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
    if f.read(7) == "version":
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
    if f.read(7) == "version":
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
    return original_fn(*args, **kwargs)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
    (result, consumed) = self._buffer_decode(data, self.errors, final)    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError    (result, consumed) = self._buffer_decode(data, self.errors, final)
(result, consumed) = self._buffer_decode(data, self.errors, final): UnicodeDecodeError
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
    UnicodeDecodeErrorUnicodeDecodeError
: 
During handling of the above exception, another exception occurred:

: Traceback (most recent call last):
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
: 
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:

Traceback (most recent call last):

Traceback (most recent call last):

During handling of the above exception, another exception occurred:

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
    result = unpickler.load()
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
        main()
main()
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
        main()
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
main()
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
        typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))main()

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
        model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
    model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
    ) = cls._load_pretrained_model(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
    ) = cls._load_pretrained_model(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
        ) = cls._load_pretrained_model() = cls._load_pretrained_model(

  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
    if f.read(7) == "version":
  File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
    ) = cls._load_pretrained_model(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

    state_dict = load_state_dict(shard_file)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
    state_dict = load_state_dict(shard_file)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
    state_dict = load_state_dict(shard_file)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
        raise OSError(state_dict = load_state_dict(shard_file)

OSError  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
    main()
    raise OSError(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
    state_dict = load_state_dict(shard_file)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
    model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
    ) = cls._load_pretrained_model(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12446) of binary: /beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/bin/python
Traceback (most recent call last):
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

I am using python 3.10.4, tokenizers==0.13.3, torch==2.0.1+cu118, and transformers==4.31.0

Apparently, this was caused because I did not have enough RAM memory.

Your error @guicalabria indeed seems linked to a lack of RAM.

@anirudhmittal this can also be due to a connection error. Could you try again and share the stack trace please?

Sign up or log in to comment