OpenOrca Dataset 'Fail to generate dataset'

#14

by mattma1970 - opened Oct 17, 2023

mattma1970

Oct 17, 2023

I've been trying to download the OpenOrca dataset to use it to finetune some other models. It seems there is something wrong with the recently committed parquet conversion functions. Downloading with load_dataset("Open-Orca/OpenOrca") results in an error when it build the training dataset. This also happens when I clone the repo manually.

ArrowInvalid Traceback (most recent call last)
File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/datasets/builder.py:1879, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
1878 _time = time.time()
-> 1879 for _, table in generator:
1880 if max_shard_size is not None and writer._num_bytes > max_shard_size:

File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py:73, in Parquet._generate_tables(self, files)
72 with open(file, "rb") as f:
---> 73 parquet_file = pq.ParquetFile(f)
74 try:

File ~/Documents/Repos/llama2-4int/.venv/lib/python3.10/site-packages/pyarrow/parquet/core.py:341, in ParquetFile.init(self, source, metadata, common_metadata, read_dictionary, memory_map, buffer_size, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, filesystem)
340 self.reader = ParquetReader()
--> 341 self.reader.open(
342 source, use_memory_map=memory_map,
343 buffer_size=buffer_size, pre_buffer=pre_buffer,
344 read_dictionary=read_dictionary, metadata=metadata,
345 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
346 decryption_properties=decryption_properties,
347 thrift_string_size_limit=thrift_string_size_limit,
348 thrift_container_size_limit=thrift_container_size_limit,
349 )
350 self.common_metadata = common_metadata
...
1911 e = e.context
-> 1912 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1914 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Anyone else having the same issue? Any solution?

mattma1970

Oct 17, 2023

More details after debugging in IDE

Exception has occurred: DatasetGenerationError
An error occurred while generating the dataset
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The above exception was the direct cause of the following exception:

File "/home/mtman/Documents/Repos/llama2-4int/data.py", line 6, in
dataset = load_dataset("./data/OpenOrca/")
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

mattma1970

Oct 17, 2023

UPDATE: I upgraded datasets to v 2.14.5 and it solved the problem. I was using 2.13.

mattma1970 changed discussion status to closed Oct 17, 2023

unaidedelf87777

OpenOrca org Oct 17, 2023

UPDATE: I upgraded datasets to v 2.14.5 and it solved the problem. I was using 2.13.

@mattma1970 FYI, I recommend using the SlimOrca subset of our data. It is verified answers, and is smaller. it will cost hundreds to train on all 4.5m entries, and to be frank some are little to no learning value. https://huggingface.co/Open-Orca/SlimOrca/

mattma1970

Oct 17, 2023

Thanks for the feedback. I'll check it out.
Cheers
Matt

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment