Datasets documentation

Builder classes

Datasets

You are viewing v2.2.1 version. A newer version v3.6.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Builder classes

🤗 Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.

class datasets.DatasetBuilder

( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

datasets.DatasetBuilder.info: Documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.
datasets.DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.
datasets.DatasetBuilder.as_dataset(): Generates a Dataset.

Configuration: Some DatasetBuilders expose multiple variants of the dataset by defining a datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in datasets.DatasetBuilder.builder_configs().

as_dataset

( split: typing.Optional[datasets.splits.Split] = None run_post_process = True ignore_verifications = False in_memory = False )

Parameters

split (datasets.Split) — Which subset of the data to return.
run_post_process (bool, default=True) — Whether to run post-processing dataset transforms and/or add indexes.
ignore_verifications (bool, default=False) — Whether to ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…).
in_memory (bool, default=False) — Whether to copy the data in-memory.

Return a Dataset for the specified split.

download_and_prepare

( download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None download_mode: typing.Optional[datasets.utils.download_manager.DownloadMode] = None ignore_verifications: bool = False try_from_hf_gcs: bool = True dl_manager: typing.Optional[datasets.utils.download_manager.DownloadManager] = None base_path: typing.Optional[str] = None use_auth_token: typing.Union[str, bool, NoneType] = None **download_and_prepare_kwargs )

Parameters

download_config (DownloadConfig, optional) — specific download configuration parameters.
download_mode (DownloadMode, optional) — select the download/generate mode - Default to REUSE_DATASET_IF_EXISTS
ignore_verifications (bool) — Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…)
try_from_hf_gcs (bool) — If True, it will try to download the already prepared dataset from the Hf google cloud storage
dl_manager (DownloadManager, optional) — specific Download Manger to use
base_path (str, optional) — base path for relative paths that are used to download files. This can be a remote url. If not specified, the value of the base_path attribute (self.base_path) will be used instead.
use_auth_token (Union[str, bool], optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.

Downloads and prepares dataset for reading.

get_all_exported_dataset_infos

( )

Empty dict if doesn’t exist

get_exported_dataset_info

( )

Empty DatasetInfo if doesn’t exist

get_imported_module_dir

( )

Return the path of the module of this class or subclass.

class datasets.GeneratorBasedBuilder

( *args writer_batch_size = None **kwargs )

Base class for datasets with data generation based on dict generators.

GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.

class datasets.BeamBasedBuilder

( *args **kwargs )

Beam based Builder.

class datasets.ArrowBasedBuilder

( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )

Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).

class datasets.BuilderConfig

( name: str = 'default' version: typing.Union[str, datasets.utils.version.Version, NoneType] = '0.0.0' data_dir: typing.Optional[str] = None data_files: typing.Optional[datasets.data_files.DataFilesDict] = None description: typing.Optional[str] = None )

Parameters

name (str, default "default") —
version (Version or str, optional) —
data_dir (str, optional) —
data_files (str or Sequence or Mapping, optional) — Path(s) to source data file(s).
description (str, optional) —

Base class for DatasetBuilder data configuration.

DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.

create_config_id

( config_kwargs: dict custom_features: typing.Optional[datasets.features.features.Features] = None )

The config id is used to build the cache directory. By default it is equal to the config name. However the name of a config is not sufficient to have a unique identifier for the dataset being generated since it doesn’t take into account:

the config kwargs that can be used to overwrite attributes
the custom features used to write the dataset
the data_files for json/text/csv/pandas datasets Therefore the config id is just the config name with an optional suffix based on these.

class datasets.DownloadManager

( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None base_path: typing.Optional[str] = None record_checksums = True )

download

( url_or_urls ) → downloaded_path(s)

Returns

downloaded_path(s)

str, The downloaded paths matching the given input url_or_urls.

Download given url(s).

download_and_extract

( url_or_urls ) → extracted_path(s)

Returns

extracted_path(s)

str, extracted paths of given URL(s).

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

download_custom

( url_or_urls custom_download ) → downloaded_path(s)

Returns

downloaded_path(s)

str, The downloaded paths matching the given input url_or_urls.

Download given urls(s) by calling custom_download.

extract

( path_or_paths num_proc = None ) → extracted_path(s)

Returns

extracted_path(s)

str, The extracted paths matching the given input path_or_paths.

Extract given path(s).

iter_archive

( path_or_buf: typing.Union[str, _io.BufferedReader] )

Parameters

path_or_buf (str or io.BufferedReader) — Archive path or archive binary file object.

Iterate over files within an archive.

iter_files

( paths: typing.Union[str, typing.List[str]] )

Parameters

paths (str or list of str) — Root paths.

Iterate over file paths.

ship_files_with_pipeline

( downloaded_path_or_paths pipeline )

Ship the files using Beam FileSystems to the pipeline temp dir.

class datasets.DownloadMode

( value names = None module = None qualname = None type = None start = 1 )

Enum for how to treat pre-existing downloads and data.

The default mode is REUSE_DATASET_IF_EXISTS, which will reuse both raw downloads and the prepared dataset if they exist.

The generations modes:

	Downloads	Dataset
`REUSE_DATASET_IF_EXISTS` (default)	Reuse	Reuse
`REUSE_CACHE_IF_EXISTS`	Reuse	Fresh
`FORCE_REDOWNLOAD`	Fresh	Fresh

class datasets.SplitGenerator

( name: str gen_kwargs: typing.Dict = <factory> )

Parameters

name (str) — Name of the Split for which the generator will create the examples. **gen_kwargs — Keyword arguments to forward to the DatasetBuilder._generate_examples method of the builder.

Defines the split information for the generator.

This should be used as returned value of GeneratorBasedBuilder._split_generators(). See GeneratorBasedBuilder._split_generators() for more info and example of usage.

class datasets.Split

( name )

Enum for dataset splits.

Datasets are typically split into different subsets to be used at various stages of training and evaluation.

TRAIN: the training data.
VALIDATION: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).
TEST: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.
ALL: the union of all defined dataset splits.

Note: All splits, including compositions inherit from datasets.SplitBase

See the :doc:guide on splits </loading> for more information.

class datasets.NamedSplit

( name )

Descriptor corresponding to a named split (train, test, …).

Example:

Each descriptor can be composed with other using addition or slice. Ex
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST

The resulting split will correspond to 25% of the train split merged with
100% of the test split.

Warning:

A split cannot be added twice, so the following will fail:

split = (
        datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
        datasets.Split.TRAIN.subsplit(datasets.percent[75:])
)  # Error
split = datasets.Split.TEST + datasets.Split.ALL  # Error

Warning:

The slices can be applied only one time. So the following are valid:

split = (
datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
datasets.Split.TEST.subsplit(datasets.percent[:50])
)
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])

But not:

train = datasets.Split.TRAIN
test = datasets.Split.TEST
split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25])
split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])

class datasets.NamedSplitAll

( )

Split corresponding to the union of all defined dataset splits.

class datasets.ReadInstruction

( split_name rounding = None from_ = None to = None unit = None )

Reading instruction for a dataset.

Examples:

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%'))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%]+train[1:-1]'))
ds = datasets.load_dataset('mnist', split=(
datasets.ReadInstruction('test', to=33, unit='%') +
datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))

# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%](pct1_dropremainder)'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))

# 10-fold validation:
tests = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
for k in range(0, 100, 10)])

from_spec

( spec )

Parameters

spec (str) — split(s) + optional slice(s) to read + optional rounding if percents are used as the slicing unit. A slice can be specified, using absolute numbers (int) or percentages (int). E.g. test: test split. test + validation: test split + validation split. test[10:]: test split, minus its first 10 records. test[:10%]: first 10% records of test split. test[:20%](pct1_dropremainder): first 10% records, rounded with the pct1_dropremainder rounding. test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train.

Creates a ReadInstruction instance out of a string spec.

to_absolute

( name2len )

Translate instruction into a list of absolute instructions.

Those absolute instructions are then to be added together.

class datasets.DownloadConfig

( cache_dir: typing.Union[pathlib.Path, str, NoneType] = None force_download: bool = False resume_download: bool = False local_files_only: bool = False proxies: typing.Optional[typing.Dict] = None user_agent: typing.Optional[str] = None extract_compressed_file: bool = False force_extract: bool = False delete_extracted: bool = False use_etag: bool = True num_proc: typing.Optional[int] = None max_retries: int = 1 use_auth_token: typing.Union[bool, str, NoneType] = None ignore_url_params: bool = False download_desc: typing.Optional[str] = None )

Parameters

cache_dir (str or Path, optional) — Specify a cache directory to save the file to (overwrite the default cache dir).
force_download (bool, default False) — If True, re-dowload the file even if it’s already cached in the cache dir.
resume_download (bool, default False) — If True, resume the download if incompletly recieved file is found.
proxies (dict, optional) —
user_agent (str, optional) — Optional string or dict that will be appended to the user-agent on remote requests.
extract_compressed_file (bool, default False) — If True and the path point to a zip or tar file, extract the compressed file in a folder along the archive.
force_extract (bool, default False) — If True when extract_compressed_file is True and the archive was already extracted, re-extract the archive and override the folder where it was extracted.
delete_extracted (bool, default False) — Whether to delete (or keep) the extracted files.
use_etag (bool, default True) — Whether to use the ETag HTTP response header to validate the cached files.
num_proc (int, optional) — The number of processes to launch to download the files in parallel.
max_retries (int, default 1) — The number of times to retry an HTTP request if it fails.
use_auth_token (str or bool, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.
ignore_url_params (bool, default False) — Whether to strip all query parameters and #fragments from the download URL before using it for caching the file.
download_desc (str, optional) — A description to be displayed alongside with the progress bar while downloading the files.

Configuration for our cached path manager.

class datasets.Version

( version_str: str description: typing.Optional[str] = None major: typing.Union[str, int, NoneType] = None minor: typing.Union[str, int, NoneType] = None patch: typing.Union[str, int, NoneType] = None )

Parameters

version_str (str) — Eg: “1.2.3”.
description (str) — A description of what is new in this version.
version_str (str) — Eg: “1.2.3”.
description (str) — A description of what is new in this version.
major (str) —
minor (str) —
patch (str) —

Dataset version MAJOR.MINOR.PATCH.

match

( other_version )

Returns True if other_version matches.

←Main classes Loading methods→