Builder classes

Builders

🤗 Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.

class datasets.DatasetBuilder

( cache_dir: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None token: typing.Union[bool, str, NoneType] = None use_auth_token = 'deprecated' repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None storage_options: typing.Optional[dict] = None writer_batch_size: typing.Optional[int] = None name = 'deprecated' **config_kwargs )

Parameters

cache_dir (str, optional) — Directory to cache data. Defaults to "~/.cache/huggingface/datasets".
dataset_name (str, optional) — Name of the dataset, if different from the builder name. Useful for packaged builders like csv, imagefolder, audiofolder, etc. to reflect the difference between datasets that use the same packaged builder.
config_name (str, optional) — Name of the dataset configuration. It affects the data generated on disk. Different configurations will have their own subdirectories and versions. If not provided, the default configuration is used (if it exists).

Added in 2.3.0

Parameter name was renamed to config_name.
hash (str, optional) — Hash specific to the dataset code. Used to update the caching directory when the dataset loading script code is updated (to avoid reusing old data). The typical caching directory (defined in self._relative_data_dir) is name/version/hash/.
base_path (str, optional) — Base path for relative paths that are used to download files. This can be a remote URL.
features (Features, optional) — Features types to use with this dataset. It can be used to change the Features types of a dataset, for example.
token (str or bool, optional) — String or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from "~/.huggingface".
repo_id (str, optional) — ID of the dataset repository. Used to distinguish builders with the same name but not coming from the same namespace, for example “squad” and “lhoestq/squad” repo IDs. In the latter, the builder name would be “lhoestq___squad”.
data_files (str or Sequence or Mapping, optional) — Path(s) to source data file(s). For builders like “csv” or “json” that need the user to specify data files. They can be either local or remote files. For convenience, you can use a DataFilesDict.
data_dir (str, optional) — Path to directory containing source data file(s). Use only if data_files is not passed, in which case it is equivalent to passing os.path.join(data_dir, "**") as data_files. For builders that require manual download, it must be the path to the local directory containing the manually downloaded data.
storage_options (dict, optional) — Key/value pairs to be passed on to the dataset file-system backend, if any.
writer_batch_size (int, optional) — Batch size used by the ArrowWriter. It defines the number of samples that are kept in memory before writing them and also the length of the arrow chunks. None means that the ArrowWriter will use its default value.
name (str) — Configuration name for the dataset.

Deprecated in 2.3.0

Use config_name instead.
**config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the corresponding builder configuration class, set on the class attribute DatasetBuilder.BUILDER_CONFIG_CLASS. The builder configuration class is BuilderConfig or a subclass of it.

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

DatasetBuilder.info: Documents the dataset, including feature names, types, shapes, version, splits, citation, etc.
DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.
DatasetBuilder.as_dataset(): Generates a Dataset.

Some DatasetBuilders expose multiple variants of the dataset by defining a BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in DatasetBuilder.builder_configs().

	Downloads	Dataset
`REUSE_DATASET_IF_EXISTS` (default)	Reuse	Reuse
`REUSE_CACHE_IF_EXISTS`	Reuse	Fresh
`FORCE_REDOWNLOAD`	Fresh	Fresh

	Verification checks
`ALL_CHECKS`	Split checks, uniqueness of the keys yielded in case of the GeneratorBuilder
	and the validity (number of files, checksums, etc.) of downloaded files
`BASIC_CHECKS` (default)	Same as `ALL_CHECKS` but without checking downloaded files
`NO_CHECKS`	None

Datasets

Builder classes

Builders

class datasets.DatasetBuilder

as_dataset

download_and_prepare

get_all_exported_dataset_infos

get_exported_dataset_info

get_imported_module_dir

class datasets.GeneratorBasedBuilder

class datasets.BeamBasedBuilder

class datasets.ArrowBasedBuilder

class datasets.BuilderConfig

create_config_id

Download

class datasets.DownloadManager

download

download_and_extract

download_custom

extract

iter_archive

iter_files

ship_files_with_pipeline

class datasets.StreamingDownloadManager

download

download_and_extract

extract

iter_archive

iter_files

class datasets.DownloadConfig

class datasets.DownloadMode

Verification

class datasets.VerificationMode

Splits

class datasets.SplitGenerator

class datasets.Split

class datasets.NamedSplit

class datasets.NamedSplitAll

class datasets.ReadInstruction

from_spec

to_absolute

Version

class datasets.Version