Datasets documentation

Builder classes

You are viewing v2.2.1 version. A newer version v3.2.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Builder classes

πŸ€— Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.

class datasets.DatasetBuilder

< >

( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in datasets.DatasetBuilder.builder_configs().

as_dataset

< >

( split: typing.Optional[datasets.splits.Split] = None run_post_process = True ignore_verifications = False in_memory = False )

Parameters

  • split (datasets.Split) — Which subset of the data to return.
  • run_post_process (bool, default=True) — Whether to run post-processing dataset transforms and/or add indexes.
  • ignore_verifications (bool, default=False) — Whether to ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…).
  • in_memory (bool, default=False) — Whether to copy the data in-memory.

Return a Dataset for the specified split.

download_and_prepare

< >

( download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None download_mode: typing.Optional[datasets.utils.download_manager.DownloadMode] = None ignore_verifications: bool = False try_from_hf_gcs: bool = True dl_manager: typing.Optional[datasets.utils.download_manager.DownloadManager] = None base_path: typing.Optional[str] = None use_auth_token: typing.Union[str, bool, NoneType] = None **download_and_prepare_kwargs )

Parameters

  • download_config (DownloadConfig, optional) — specific download configuration parameters.
  • download_mode (DownloadMode, optional) — select the download/generate mode - Default to REUSE_DATASET_IF_EXISTS
  • ignore_verifications (bool) — Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…)
  • try_from_hf_gcs (bool) — If True, it will try to download the already prepared dataset from the Hf google cloud storage
  • dl_manager (DownloadManager, optional) — specific Download Manger to use
  • base_path (str, optional) — base path for relative paths that are used to download files. This can be a remote url. If not specified, the value of the base_path attribute (self.base_path) will be used instead.
  • use_auth_token (Union[str, bool], optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.

Downloads and prepares dataset for reading.

get_all_exported_dataset_infos

< >

( )

Empty dict if doesn’t exist

get_exported_dataset_info

< >

( )

Empty DatasetInfo if doesn’t exist

get_imported_module_dir

< >

( )

Return the path of the module of this class or subclass.

class datasets.GeneratorBasedBuilder

< >

( *args writer_batch_size = None **kwargs )

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.

class datasets.BeamBasedBuilder

< >

( *args **kwargs )

Beam based Builder.

class datasets.ArrowBasedBuilder

< >

( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )

Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

class datasets.BuilderConfig

< >

( name: str = 'default' version: typing.Union[str, datasets.utils.version.Version, NoneType] = '0.0.0' data_dir: typing.Optional[str] = None data_files: typing.Optional[datasets.data_files.DataFilesDict] = None description: typing.Optional[str] = None )

Parameters

  • name (str, default "default") —
  • version (Version or str, optional) —
  • data_dir (str, optional) —
  • data_files (str or Sequence or Mapping, optional) — Path(s) to source data file(s).
  • description (str, optional) —

Base class for DatasetBuilder data configuration.

DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.

create_config_id

< >

( config_kwargs: dict custom_features: typing.Optional[datasets.features.features.Features] = None )

The config id is used to build the cache directory. By default it is equal to the config name. However the name of a config is not sufficient to have a unique identifier for the dataset being generated since it doesn’t take into account:

  • the config kwargs that can be used to overwrite attributes
  • the custom features used to write the dataset
  • the data_files for json/text/csv/pandas datasets Therefore the config id is just the config name with an optional suffix based on these.

class datasets.DownloadManager

< >

( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None base_path: typing.Optional[str] = None record_checksums = True )

download

< >

( url_or_urls ) β†’ downloaded_path(s)

Returns

downloaded_path(s)

str, The downloaded paths matching the given input url_or_urls.

Download given url(s).

download_and_extract

< >

( url_or_urls ) β†’ extracted_path(s)

Returns

extracted_path(s)

str, extracted paths of given URL(s).

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

download_custom

< >

( url_or_urls custom_download ) β†’ downloaded_path(s)

Returns

downloaded_path(s)

str, The downloaded paths matching the given input url_or_urls.

Download given urls(s) by calling custom_download.

extract

< >

( path_or_paths num_proc = None ) β†’ extracted_path(s)

Returns

extracted_path(s)

str, The extracted paths matching the given input path_or_paths.

Extract given path(s).

iter_archive

< >

( path_or_buf: typing.Union[str, _io.BufferedReader] )

Parameters

  • path_or_buf (str or io.BufferedReader) — Archive path or archive binary file object.

Iterate over files within an archive.

iter_files

< >

( paths: typing.Union[str, typing.List[str]] )

Parameters

  • paths (str or list of str) — Root paths.

Iterate over file paths.

ship_files_with_pipeline

< >

( downloaded_path_or_paths pipeline )

Ship the files using Beam FileSystems to the pipeline temp dir.

class datasets.DownloadMode

< >

( value names = None module = None qualname = None type = None start = 1 )

Enum for how to treat pre-existing downloads and data.

The default mode is REUSE_DATASET_IF_EXISTS, which will reuse both raw downloads and the prepared dataset if they exist.

The generations modes:

Downloads Dataset
REUSE_DATASET_IF_EXISTS (default) Reuse Reuse
REUSE_CACHE_IF_EXISTS Reuse Fresh
FORCE_REDOWNLOAD Fresh Fresh

class datasets.SplitGenerator

< >

( name: str gen_kwargs: typing.Dict = <factory> )

Parameters

  • name (str) — Name of the Split for which the generator will create the examples. **gen_kwargs — Keyword arguments to forward to the DatasetBuilder._generate_examples method of the builder.

Defines the split information for the generator.

This should be used as returned value of GeneratorBasedBuilder._split_generators(). See GeneratorBasedBuilder._split_generators() for more info and example of usage.

class datasets.Split

< >

( name )

Enum for dataset splits.

Datasets are typically split into different subsets to be used at various stages of training and evaluation.

  • TRAIN: the training data.
  • VALIDATION: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).
  • TEST: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.
  • ALL: the union of all defined dataset splits.

Note: All splits, including compositions inherit from datasets.SplitBase

See the :doc:guide on splits </loading> for more information.

class datasets.NamedSplit

< >

( name )

Descriptor corresponding to a named split (train, test, …).

Example:

Each descriptor can be composed with other using addition or slice. Ex
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST

The resulting split will correspond to 25% of the train split merged with
100% of the test split.

Warning:

A split cannot be added twice, so the following will fail:

split = (
        datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
        datasets.Split.TRAIN.subsplit(datasets.percent[75:])
)  # Error
split = datasets.Split.TEST + datasets.Split.ALL  # Error

Warning:

The slices can be applied only one time. So the following are valid:

split = (
datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
datasets.Split.TEST.subsplit(datasets.percent[:50])
)
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])

But not:

train = datasets.Split.TRAIN
test = datasets.Split.TEST
split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25])
split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])

class datasets.NamedSplitAll

< >

( )

Split corresponding to the union of all defined dataset splits.

class datasets.ReadInstruction

< >

( split_name rounding = None from_ = None to = None unit = None )

Reading instruction for a dataset.

Examples:

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%'))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%]+train[1:-1]'))
ds = datasets.load_dataset('mnist', split=(
datasets.ReadInstruction('test', to=33, unit='%') +
datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%](pct1_dropremainder)'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

# 10-fold validation:
tests = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
for k in range(0, 100, 10)])

from_spec

< >

( spec )

Parameters

  • spec (str) — split(s) + optional slice(s) to read + optional rounding if percents are used as the slicing unit. A slice can be specified, using absolute numbers (int) or percentages (int). E.g. test: test split. test + validation: test split + validation split. test[10:]: test split, minus its first 10 records. test[:10%]: first 10% records of test split. test[:20%](pct1_dropremainder): first 10% records, rounded with the pct1_dropremainder rounding. test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train.

Creates a ReadInstruction instance out of a string spec.

to_absolute

< >

( name2len )

Translate instruction into a list of absolute instructions.

Those absolute instructions are then to be added together.

class datasets.DownloadConfig

< >

( cache_dir: typing.Union[pathlib.Path, str, NoneType] = None force_download: bool = False resume_download: bool = False local_files_only: bool = False proxies: typing.Optional[typing.Dict] = None user_agent: typing.Optional[str] = None extract_compressed_file: bool = False force_extract: bool = False delete_extracted: bool = False use_etag: bool = True num_proc: typing.Optional[int] = None max_retries: int = 1 use_auth_token: typing.Union[bool, str, NoneType] = None ignore_url_params: bool = False download_desc: typing.Optional[str] = None )

Parameters

  • cache_dir (str or Path, optional) — Specify a cache directory to save the file to (overwrite the default cache dir).
  • force_download (bool, default False) — If True, re-dowload the file even if it’s already cached in the cache dir.
  • resume_download (bool, default False) — If True, resume the download if incompletly recieved file is found.
  • proxies (dict, optional) —
  • user_agent (str, optional) — Optional string or dict that will be appended to the user-agent on remote requests.
  • extract_compressed_file (bool, default False) — If True and the path point to a zip or tar file, extract the compressed file in a folder along the archive.
  • force_extract (bool, default False) — If True when extract_compressed_file is True and the archive was already extracted, re-extract the archive and override the folder where it was extracted.
  • delete_extracted (bool, default False) — Whether to delete (or keep) the extracted files.
  • use_etag (bool, default True) — Whether to use the ETag HTTP response header to validate the cached files.
  • num_proc (int, optional) — The number of processes to launch to download the files in parallel.
  • max_retries (int, default 1) — The number of times to retry an HTTP request if it fails.
  • use_auth_token (str or bool, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.
  • ignore_url_params (bool, default False) — Whether to strip all query parameters and #fragments from the download URL before using it for caching the file.
  • download_desc (str, optional) — A description to be displayed alongside with the progress bar while downloading the files.

Configuration for our cached path manager.

class datasets.Version

< >

( version_str: str description: typing.Optional[str] = None major: typing.Union[str, int, NoneType] = None minor: typing.Union[str, int, NoneType] = None patch: typing.Union[str, int, NoneType] = None )

Parameters

  • version_str (str) — Eg: “1.2.3”.
  • description (str) — A description of what is new in this version.
  • version_str (str) — Eg: “1.2.3”.
  • description (str) — A description of what is new in this version.
  • major (str) —
  • minor (str) —
  • patch (str) —

Dataset version MAJOR.MINOR.PATCH.

match

< >

( other_version )

Returns True if other_version matches.