Pretraining error reported

#341
by babykai - opened

Hello Author, we made an error in the re-pre-training model using 5.5 million cells. And tokenize takes more than five days. The error content is as follows. I checked available storage (308T). How can I solve the error and reduce the time required for tokenize, thank you
Code:
from geneformer import TranscriptomeTokenizer
tk = TranscriptomeTokenizer({}, nproc=30)
tk.tokenize_data("all_cellxgene",
"./Geneformer",
"all_cellxgene",
file_format="h5ad",
use_generator=True)

Error:
Traceback (most recent call last):
File "/weikai/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 1678, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "/weikai/anaconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 594, in finalize
self.stream.close()
File "/weikai/anaconda3/lib/python3.9/site-packages/fsspec/implementations/local.py", line 391, in close
return self.f.close()
OSError: [Errno 122] Disk quota exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/weikai/anaconda3/lib/python3.9/site-packages/datasets/builder.py", line 1703, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "/weikai/anaconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 589, in finalize
self._build_writer(self.schema)
File "/weikai/anaconda3/lib/python3.9/site-packages/datasets/arrow_writer.py", line 400, in _build_writer
self.pa_writer = self._WRITER_CLASS(self.stream, schema)
File "/weikai/anaconda3/lib/python3.9/site-packages/pyarrow/ipc.py", line 85, in init
self._open(sink, schema, options=options)
File "pyarrow/ipc.pxi", line 582, in pyarrow.lib._RecordBatchStreamWriter._open
File "pyarrow/io.pxi", line 2062, in pyarrow.lib.get_writer
File "pyarrow/io.pxi", line 215, in pyarrow.lib.NativeFile.get_output_stream
File "pyarrow/io.pxi", line 229, in pyarrow.lib.NativeFile._assert_writable
File "pyarrow/io.pxi", line 220, in pyarrow.lib.NativeFile._assert_open
ValueError: I/O operation on closed file

Thank you for your interest in Geneformer! It appears you have run out of disk space ("OSError: [Errno 122] Disk quota exceeded"). You should save the file somewhere with increased disk space.

ctheodoris changed discussion status to closed

Sign up or log in to comment