int32 overflow when using a large enough dataset
Hi,
The
output_dataset = Dataset.from_dict(dataset_dict)
call from within the Tokenizer causes an Int32 overflow when the dataset is too big (exciting max(int32) in size):
This is due to this function in the Huggingface code:
def numpy_to_pyarrow_listarray(arr: np.ndarray, type: pa.DataType = None) -> pa.ListArray:
"""Build a PyArrow ListArray from a multidimensional NumPy array"""
arr = np.array(arr)
values = pa.array(arr.flatten(), type=type)
for i in range(arr.ndim - 1):
n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
step_offsets = arr.shape[arr.ndim - i - 1]
offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
values = pa.ListArray.from_arrays(offsets, values)
return values
I found that in the dictionary values are saved as np.array rather than python List. This doesn't happen:
## DEBUG
print('Changing lists to np.arrays..')
dataset_dict['input_ids'] = np.array(dataset_dict['input_ids'], dtype='object')
dataset_dict['gene'] = np.array(dataset_dict['gene'], dtype='object')
## DEBUG
# create dataset
output_dataset = Dataset.from_dict(dataset_dict)
Has someone came across this?
Kind regards,
Eyal.
Thank you for noting this! We did not come across this when tokenizing Genecorpus-30M. It would be very helpful if you could check the Huggingface Datasets issues to see if there is a suggested solution to this, or open a new issue if this question has not already been raised. It would be great if you could update this discussion with any resolution you may find from Huggingface to help other users who may encounter this issue.