There was a problem preparing the pre-training corpus

#315
by babykai - opened

Hello, we are using geneformer tokenizing_scRNAseq_data.ipynb

When the script was processed with the pre-training corpus (h5ad format corpus of about 3 million cells), the following error was reported.
Traceback (most recent call last):
File "geneformer_new/Geneformer/examples/pretraining_new_model/script_cellxgene.py", line 24, in < module>
tk.tokenize_data("all",
The File "anaconda3 / envs/geneformer_new1 / lib/python3.10 / site - packages/geneformer tokenizer. Py", line 157, in tokenize_data
tokenized_dataset = self.create_dataset(
The File "anaconda3 / envs/geneformer_new1 / lib/python3.10 / site - packages/geneformer tokenizer. Py", line 352, in create_dataset
output_dataset = Dataset.from_dict(dataset_dict)
File "anaconda3 / envs geneformer_new1 / lib/python3.10 / site - packages/datasets/arrow_dataset. Py", line 911, in from_dict
pa_table = InMemoryTable.from_pydict(mapping=mapping)
The File "anaconda3 / envs/geneformer_new1 / lib/python3.10 / site - packages/datasets/table. Py", line 762, in from_pydict
return cls(pa.Table.from_pydict(*args, **kwargs))
File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
File "pyarrow/array.pxi", line 248, in pyarrow.lib.array
File "pyarrow/array.pxi", line 113, in pyarrow.lib._handle_arrow_array_protocol
File "anaconda3 / envs geneformer_new1 / lib/python3.10 / site - packages/datasets/arrow_writer. Py", line 188, in arrow_array
out = list_of_np_array_to_pyarrow_listarray(data)
File "anaconda3 / envs/geneformer_new1 / lib/python3.10 / site - packages/datasets/features/features. The p y", line 1428, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
File "anaconda3 / envs/geneformer_new1 / lib/python3.10 / site - packages/datasets/features/features. The p y", line 1420, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
File "pyarrow/array.pxi", line 340, in pyarrow.lib.array
File "pyarrow/array.pxi", line 86, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2147486083 too large to fit in C integer type

How should this situation be resolved

Thank you for your question - please try to set use_generator to True.

ctheodoris changed discussion status to closed

I want to ask, I have half a million of data, also appeared "Value 2147486083 too large to fit in C integer type" problem, in the set use_generator: Bool to True, appeared the token file.Now,I want to fine-tune this set of data cell classifing, but appeared to report an error "Found dtype Long but expected Float", how to solve it?

image.png

image.png

image.png

image.png

Sign up or log in to comment