2023-08-21 20:05:57,721:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 20:05:57,721:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 20:05:57,721:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 20:05:57,722:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 20:05:58,740:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 Couldn't find a dataset script at /Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/OBELICS/OBELICS.py or any data file in the same directory. Couldn't find 'OBELICS' on the Hugging Face Hub either: FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/OBELICS/OBELICS.py Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 219, in load_truncated_dataset full_dataset = load_dataset( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/load.py", line 1656, in load_dataset builder_instance = load_dataset_builder( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/load.py", line 1439, in load_dataset_builder dataset_module = dataset_module_factory( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/load.py", line 1189, in dataset_module_factory raise FileNotFoundError( FileNotFoundError: Couldn't find a dataset script at /Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/OBELICS/OBELICS.py or any data file in the same directory. Couldn't find 'OBELICS' on the Hugging Face Hub either: FileNotFoundError: Couldn't find file at https://raw.githubusercontent.com/huggingface/datasets/master/datasets/OBELICS/OBELICS.py 2023-08-21 20:05:58,752:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-21 20:05:58,752:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-21 20:08:11,924:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 20:08:11,925:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 20:08:11,925:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 20:08:11,925:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 21:52:44,287:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 21:52:44,287:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 21:52:44,287:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train[:10%]', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 21:52:44,287:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 22:26:00,109:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:26:00,109:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:26:00,109:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:26:49,878:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:26:49,879:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:26:49,879:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:27:08,087:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:27:08,088:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:27:08,088:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:27:08,088:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 22:43:25,230:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:43:25,231:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:43:25,231:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='ri', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:43:25,231:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 22:54:30,712:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:54:30,712:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:54:30,713:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train[:100]', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:54:30,713:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-21 22:55:32,445:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-21 22:55:32,446:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-21 22:55:32,446:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train[:10]', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-21 22:55:32,446:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 00:14:53,699:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:14:53,699:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:14:53,700:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:14:53,700:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 00:26:39,298:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:26:39,299:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:26:39,299:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:26:58,461:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:26:58,461:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:26:58,461:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:26:58,461:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 00:27:30,030:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'text' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 232, in __init__ self.load_or_prepare_text_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 261, in load_or_prepare_text_dataset self.prepare_text_dset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 274, in prepare_text_dset self.text_dset = self.dset.map( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2376, in map return self._map_single( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 551, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/fingerprint.py", line 458, in wrapper out = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2764, in _map_single batch = apply_function_on_filtered_inputs( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2336, in decorated result = f(decorated_item, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 275, in lambda examples: ds_utils.extract_field( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 358, in extract_field item_list = examples[field_path[0]] File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 132, in __getitem__ values = super().__getitem__(key) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/collections/__init__.py", line 1058, in __getitem__ raise KeyError(key) KeyError: 'text' 2023-08-22 00:27:30,058:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 00:27:30,059:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 00:31:42,951:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:31:42,951:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:31:42,951:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:32:41,235:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:32:41,235:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:32:41,235:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:32:41,236:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 00:33:14,893:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'text' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 232, in __init__ self.load_or_prepare_text_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 261, in load_or_prepare_text_dataset self.prepare_text_dset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 274, in prepare_text_dset self.text_dset = self.dset.map( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2376, in map return self._map_single( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 551, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/fingerprint.py", line 458, in wrapper out = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2764, in _map_single batch = apply_function_on_filtered_inputs( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2336, in decorated result = f(decorated_item, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 275, in lambda examples: ds_utils.extract_field( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 358, in extract_field item_list = examples[field_path[0]] File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 132, in __getitem__ values = super().__getitem__(key) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/collections/__init__.py", line 1058, in __getitem__ raise KeyError(key) KeyError: 'text' 2023-08-22 00:33:14,916:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 00:33:14,916:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 00:41:09,513:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 00:41:09,514:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 00:41:09,514:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 00:41:09,514:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 00:41:40,048:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'text' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 232, in __init__ self.load_or_prepare_text_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 261, in load_or_prepare_text_dataset self.prepare_text_dset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 274, in prepare_text_dset self.text_dset = self.dset.map( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2376, in map return self._map_single( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 551, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/fingerprint.py", line 458, in wrapper out = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2764, in _map_single batch = apply_function_on_filtered_inputs( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2336, in decorated result = f(decorated_item, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 275, in lambda examples: ds_utils.extract_field( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 358, in extract_field item_list = examples[field_path[0]] File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 132, in __getitem__ values = super().__getitem__(key) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/collections/__init__.py", line 1058, in __getitem__ raise KeyError(key) KeyError: 'text' 2023-08-22 00:41:40,065:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 00:41:40,065:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 01:02:57,529:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 01:02:57,530:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 01:02:57,530:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 01:02:57,530:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 01:11:04,846:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 Object of type bytes is not JSON serializable Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 217, in load_truncated_dataset _ = f.write(json.dumps(row) + "\n") File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type bytes is not JSON serializable 2023-08-22 01:11:04,879:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 01:11:04,880:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 01:39:04,554:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 01:39:04,554:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 01:39:04,554:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 01:39:20,142:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 01:39:20,142:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 01:39:20,142:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 01:39:20,143:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 01:48:32,904:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 01:48:32,905:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 01:48:32,905:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 01:48:53,726:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 01:48:53,727:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 01:48:53,727:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 01:48:53,727:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 01:56:55,039:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 Object of type bytes is not JSON serializable Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 216, in load_truncated_dataset _ = f.write(json.dumps(row) + "\n") File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 231, in dumps return _default_encoder.encode(obj) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type bytes is not JSON serializable 2023-08-22 01:56:55,133:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 01:56:55,134:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/m4-bias-eval-fair-face', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:15:01,934:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:15:01,934:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:15:01,934:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:15:35,252:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:15:35,253:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:15:35,253:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:15:35,253:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:16:08,141:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 type object 'Dataset' has no attribute 'from_generator' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 214, in load_truncated_dataset dataset = Dataset.from_generator(gen, features=iterable_dataset.features) AttributeError: type object 'Dataset' has no attribute 'from_generator' 2023-08-22 02:16:08,158:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:16:08,158:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:16:54,319:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:16:54,319:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:16:54,320:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:16:54,320:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:17:25,827:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 type object 'Dataset' has no attribute 'from_generator' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 214, in load_truncated_dataset dataset = Dataset.from_generator(gen, features=iterable_dataset.features) AttributeError: type object 'Dataset' has no attribute 'from_generator' 2023-08-22 02:17:25,835:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:17:25,835:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:22:04,256:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:22:04,257:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:22:04,257:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:22:24,147:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:22:24,148:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:22:24,148:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:22:24,148:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:23:00,157:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 type object 'Dataset' has no attribute 'from_generator' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 229, in __init__ self.dset = self._get_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 240, in _get_dataset dset = ds_utils.load_truncated_dataset(self.dset_name, self.dset_config, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 214, in load_truncated_dataset dataset = Dataset.from_generator(gen, features=iterable_dataset.features) AttributeError: type object 'Dataset' has no attribute 'from_generator' 2023-08-22 02:23:00,176:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:23:00,176:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:25:16,788:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:25:16,789:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:25:16,789:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:25:16,789:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:25:51,681:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'text' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 62, in load_or_prepare dstats = dataset_statistics.DatasetStatisticsCacheClass(**dataset_args, File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 232, in __init__ self.load_or_prepare_text_dataset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 261, in load_or_prepare_text_dataset self.prepare_text_dset() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 274, in prepare_text_dset self.text_dset = self.dset.map( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map for rank, done, content in Dataset._map_single(**dataset_kwargs): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3474, in _map_single batch = apply_function_on_filtered_inputs( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 275, in lambda examples: ds_utils.extract_field( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/utils/dataset_utils.py", line 361, in extract_field item_list = examples[field_path[0]] File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/venv/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 270, in __getitem__ value = self.data[key] KeyError: 'text' 2023-08-22 02:25:51,704:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:25:51,705:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['text'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:28:13,954:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:28:13,954:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:28:13,954:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:28:13,955:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:28:44,379:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:64 Tokenizing dataset. 2023-08-22 02:28:44,521:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:66 Calculating vocab. 2023-08-22 02:28:44,874:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'CountVectorizer' object has no attribute 'get_feature_names_out' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 67, in load_or_prepare dstats.load_or_prepare_vocab() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 373, in load_or_prepare_vocab word_count_df = count_vocab_frequencies(self.tokenized_df) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 558, in count_vocab_frequencies [np.sum(tf, axis=0)], columns=cvec.get_feature_names_out() AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out' 2023-08-22 02:28:44,875:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:28:44,875:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 02:29:57,574:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:29:57,574:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:29:57,574:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:30:47,498:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 02:30:47,499:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 02:30:47,499:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 02:30:47,499:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 02:31:19,187:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:64 Tokenizing dataset. 2023-08-22 02:31:19,299:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:66 Calculating vocab. 2023-08-22 02:31:19,594:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:333 'CountVectorizer' object has no attribute 'get_feature_names_out' Traceback (most recent call last): File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 309, in main pass_args_to_DMT( File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 160, in pass_args_to_DMT load_or_prepare(dataset_args, calculation=calculation, use_cache=use_cache) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py", line 67, in load_or_prepare dstats.load_or_prepare_vocab() File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 373, in load_or_prepare_vocab word_count_df = count_vocab_frequencies(self.tokenized_df) File "/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/data_measurements/dataset_statistics.py", line 558, in count_vocab_frequencies [np.sum(tf, axis=0)], columns=cvec.get_feature_names_out() AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out' 2023-08-22 02:31:19,595:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:341 Data measurements not computed. ☹️ 2023-08-22 02:31:19,595:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:342 An error occurred in computing data measurements for dataset with arguments: Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True). Feel free to make an issue here: https://github.com/huggingface/data-measurements-tool/issues 2023-08-22 03:40:26,643:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 03:40:26,644:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 03:40:26,644:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 03:41:31,905:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 03:41:31,906:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 03:41:31,906:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 03:42:37,371:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 03:42:37,371:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 03:42:37,371:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 03:42:37,372:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 03:46:15,435:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:64 Tokenizing dataset. 2023-08-22 03:48:12,829:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:66 Calculating vocab. 2023-08-22 03:50:15,378:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:73 * Calculating general statistics. 2023-08-22 03:50:32,272:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:75 Done! 2023-08-22 03:50:32,272:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:76 Basic text statistics now available at cache_dir/HuggingFaceM4/OBELICS_default_train_texts/general_stats_dict.json. 2023-08-22 03:50:32,272:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:80 * Calculating text duplicates. 2023-08-22 03:50:40,705:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:83 If all went well, then results are in the following files: 2023-08-22 03:50:40,705:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:85 statistics: cache_dir/HuggingFaceM4/OBELICS_default_train_texts/text_duplicates/text_duplicates.json 2023-08-22 03:50:40,706:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:85 html: cache_dir/HuggingFaceM4/OBELICS_default_train_texts/text_duplicates/text_duplicates.html 2023-08-22 03:50:40,706:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:88 * Calculating text lengths. 2023-08-22 03:52:44,734:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:97 * Calculating label statistics. 2023-08-22 03:52:44,735:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:99 No label field found. 2023-08-22 03:52:44,735:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:100 No label statistics to calculate. 2023-08-22 05:08:25,998:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 05:08:25,999:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 05:08:25,999:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True) 2023-08-22 05:08:25,999:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:147 Not using any cache; starting afresh 2023-08-22 05:10:18,037:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:64 Tokenizing dataset. 2023-08-22 05:11:03,365:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:66 Calculating vocab. 2023-08-22 05:11:51,246:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:73 * Calculating general statistics. 2023-08-22 05:12:01,669:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:75 Done! 2023-08-22 05:12:01,669:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:76 Basic text statistics now available at cache_dir/HuggingFaceM4/OBELICS_default_train_texts/general_stats_dict.json. 2023-08-22 05:12:01,669:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:80 * Calculating text duplicates. 2023-08-22 05:12:07,163:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:83 If all went well, then results are in the following files: 2023-08-22 05:12:07,163:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:85 statistics: cache_dir/HuggingFaceM4/OBELICS_default_train_texts/text_duplicates/text_duplicates.json 2023-08-22 05:12:07,163:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:85 html: cache_dir/HuggingFaceM4/OBELICS_default_train_texts/text_duplicates/text_duplicates.html 2023-08-22 05:12:07,163:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:88 * Calculating text lengths. 2023-08-22 05:12:46,568:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:97 * Calculating label statistics. 2023-08-22 05:12:46,568:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:99 No label field found. 2023-08-22 05:12:46,568:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:100 No label statistics to calculate. 2023-08-22 05:25:43,095:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:174 Label column name not given. Assuming it's 'label'. 2023-08-22 05:25:43,096:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:280 Proceeding with the following arguments: 2023-08-22 05:25:43,096:/Users/ezi/Desktop/HF/DMT/data-measurements-tool/DMT2023/data-measurements-tool/run_data_measurements.py, run_data_measurements:281 Namespace(dataset='HuggingFaceM4/OBELICS', config='default', split='train', feature=['texts'], calculation=None, label_field='label', label_names=[], use_cache=False, out_dir='cache_dir', overwrite_previous=False, email=None, push_cache_to_hub=False, prepare_GUI_data=False, keep_local=True)