OpenOrca Dataset 'Fail to generate dataset'

#14
by mattma1970 - opened

I've been trying to download the OpenOrca dataset to use it to finetune some other models. It seems there is something wrong with the recently committed parquet conversion functions. Downloading with load_dataset("Open-Orca/OpenOrca") results in an error when it build the training dataset. This also happens when I clone the repo manually.

ArrowInvalid Traceback (most recent call last)
File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/datasets/builder.py:1879, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1878 _time = time.time()
-> 1879 for _, table in generator:
1880 if max_shard_size is not None and writer._num_bytes > max_shard_size:

File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py:73, in Parquet._generate_tables(self, files)
72 with open(file, "rb") as f:
---> 73 parquet_file = pq.ParquetFile(f)
74 try:

File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/pyarrow/parquet/core.py:341, in ParquetFile.init(self, source, metadata, common_metadata, read_dictionary, memory_map, buffer_size, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, filesystem)
340 self.reader = ParquetReader()
--> 341 self.reader.open(
342 source, use_memory_map=memory_map,
343 buffer_size=buffer_size, pre_buffer=pre_buffer,
344 read_dictionary=read_dictionary, metadata=metadata,
345 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
346 decryption_properties=decryption_properties,
347 thrift_string_size_limit=thrift_string_size_limit,
348 thrift_container_size_limit=thrift_container_size_limit,
349 )
350 self.common_metadata = common_metadata
...
1911 e = e.context
-> 1912 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1914 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Anyone else having the same issue? Any solution?

More details after debugging in IDE

Exception has occurred: DatasetGenerationError
An error occurred while generating the dataset
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The above exception was the direct cause of the following exception:

File "/home/mtman/Documents/Repos/llama2-4int/data.py", line 6, in
dataset = load_dataset("./data/OpenOrca/")
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

UPDATE: I upgraded datasets to v 2.14.5 and it solved the problem. I was using 2.13.

mattma1970 changed discussion status to closed

UPDATE: I upgraded datasets to v 2.14.5 and it solved the problem. I was using 2.13.

@mattma1970 FYI, I recommend using the SlimOrca subset of our data. It is verified answers, and is smaller. it will cost hundreds to train on all 4.5m entries, and to be frank some are little to no learning value. https://huggingface.co/Open-Orca/SlimOrca/

Thanks for the feedback. I'll check it out.
Cheers
Matt

Sign up or log in to comment