int32 overflow when using a large enough dataset

#152
by EyalItskov - opened

Hi,

The

output_dataset = Dataset.from_dict(dataset_dict)

call from within the Tokenizer causes an Int32 overflow when the dataset is too big (exciting max(int32) in size):

image.png

This is due to this function in the Huggingface code:

def numpy_to_pyarrow_listarray(arr: np.ndarray, type: pa.DataType = None) -> pa.ListArray:
    """Build a PyArrow ListArray from a multidimensional NumPy array"""
    arr = np.array(arr)
    values = pa.array(arr.flatten(), type=type)
    for i in range(arr.ndim - 1):
        n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
        step_offsets = arr.shape[arr.ndim - i - 1]
        offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
        values = pa.ListArray.from_arrays(offsets, values)
    return values

I found that in the dictionary values are saved as np.array rather than python List. This doesn't happen:

        ## DEBUG
        print('Changing lists to np.arrays..')
        dataset_dict['input_ids'] = np.array(dataset_dict['input_ids'], dtype='object')
        dataset_dict['gene'] = np.array(dataset_dict['gene'], dtype='object')
        ## DEBUG


        # create dataset
        output_dataset = Dataset.from_dict(dataset_dict)

Has someone came across this?

Kind regards,
Eyal.

Thank you for noting this! We did not come across this when tokenizing Genecorpus-30M. It would be very helpful if you could check the Huggingface Datasets issues to see if there is a suggested solution to this, or open a new issue if this question has not already been raised. It would be great if you could update this discussion with any resolution you may find from Huggingface to help other users who may encounter this issue.

ctheodoris changed discussion status to closed

image.png
you can split the dataset and then concatenate them together

Sign up or log in to comment