ctheodoris/Geneformer · int32 overflow when using a large enough dataset

Aug 1, 2023

Hi,

The

output_dataset = Dataset.from_dict(dataset_dict)

call from within the Tokenizer causes an Int32 overflow when the dataset is too big (exciting max(int32) in size):

This is due to this function in the Huggingface code:

def numpy_to_pyarrow_listarray(arr: np.ndarray, type: pa.DataType = None) -> pa.ListArray:
    """Build a PyArrow ListArray from a multidimensional NumPy array"""
    arr = np.array(arr)
    values = pa.array(arr.flatten(), type=type)
    for i in range(arr.ndim - 1):
        n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
        step_offsets = arr.shape[arr.ndim - i - 1]
        offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
        values = pa.ListArray.from_arrays(offsets, values)
    return values

I found that in the dictionary values are saved as np.array rather than python List. This doesn't happen:

        ## DEBUG
        print('Changing lists to np.arrays..')
        dataset_dict['input_ids'] = np.array(dataset_dict['input_ids'], dtype='object')
        dataset_dict['gene'] = np.array(dataset_dict['gene'], dtype='object')
        ## DEBUG


        # create dataset
        output_dataset = Dataset.from_dict(dataset_dict)

Has someone came across this?

Kind regards,
Eyal.

ctheodoris

Owner Aug 2, 2023

Thank you for noting this! We did not come across this when tokenizing Genecorpus-30M. It would be very helpful if you could check the Huggingface Datasets issues to see if there is a suggested solution to this, or open a new issue if this question has not already been raised. It would be great if you could update this discussion with any resolution you may find from Huggingface to help other users who may encounter this issue.

ctheodoris changed discussion status to closed Aug 2, 2023

super-dainiu

Aug 28, 2023

you can split the dataset and then concatenate them together