Builder classes
Builders
🤗 Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.
class datasets.DatasetBuilder
< source >( cache_dir: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None name = 'deprecated' **config_kwargs )
Parameters
-
cache_dir (
str
, optional) — Directory to cache data. Defaults to"~/.cache/huggingface/datasets"
. -
config_name (
str
, optional) — Name of the dataset configuration. It affects the data generated on disk. Different configurations will have their own subdirectories and versions. If not provided, the default configuration is used (if it exists).Added in 2.3.0
Parameter
name
was renamed toconfig_name
. -
hash (
str
, optional) — Hash specific to the dataset code. Used to update the caching directory when the dataset loading script code is updated (to avoid reusing old data). The typical caching directory (defined inself._relative_data_dir
) isname/version/hash/
. -
base_path (
str
, optional) — Base path for relative paths that are used to download files. This can be a remote URL. - features (Features, optional) — Features types to use with this dataset. It can be used to change the Features types of a dataset, for example.
-
use_auth_token (
str
orbool
, optional) — String or boolean to use as Bearer token for remote files on the Datasets Hub. IfTrue
, will get token from"~/.huggingface"
. -
repo_id (
str
, optional) — ID of the dataset repository. Used to distinguish builders with the same name but not coming from the same namespace, for example “squad” and “lhoestq/squad” repo IDs. In the latter, the builder name would be “lhoestq___squad”. -
data_files (
str
orSequence
orMapping
, optional) — Path(s) to source data file(s). For builders like “csv” or “json” that need the user to specify data files. They can be either local or remote files. For convenience, you can use aDataFilesDict
. -
data_dir (
str
, optional) — Path to directory containing source data file(s). Use only ifdata_files
is not passed, in which case it is equivalent to passingos.path.join(data_dir, "**")
asdata_files
. For builders that require manual download, it must be the path to the local directory containing the manually downloaded data. -
name (
str
) — Configuration name for the dataset.Deprecated in 2.3.0
Use
config_name
instead. - **config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the corresponding builder configuration class, set on the class attribute DatasetBuilder.BUILDER_CONFIG_CLASS. The builder configuration class is BuilderConfig or a subclass of it.
Abstract base class for all datasets.
DatasetBuilder
has 3 key methods:
DatasetBuilder.info
: Documents the dataset, including feature names, types, shapes, version, splits, citation, etc.- DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.
- DatasetBuilder.as_dataset(): Generates a Dataset.
Some DatasetBuilder
s expose multiple variants of the
dataset by defining a BuilderConfig subclass and accepting a
config object (or name) on construction. Configurable datasets expose a
pre-defined set of configurations in DatasetBuilder.builder_configs()
.
as_dataset
< source >( split: typing.Optional[datasets.splits.Split] = None run_post_process = True ignore_verifications = False in_memory = False )
Parameters
-
split (
datasets.Split
) — Which subset of the data to return. -
run_post_process (
bool
, defaults toTrue
) — Whether to run post-processing dataset transforms and/or add indexes. -
ignore_verifications (
bool
, defaults toFalse
) — Whether to ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…). -
in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory.
Return a Dataset for the specified split.
download_and_prepare
< source >( output_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None download_mode: typing.Optional[datasets.download.download_manager.DownloadMode] = None ignore_verifications: bool = False try_from_hf_gcs: bool = True dl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None base_path: typing.Optional[str] = None use_auth_token = 'deprecated' file_format: str = 'arrow' max_shard_size: typing.Union[str, int, NoneType] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None **download_and_prepare_kwargs )
Parameters
-
output_dir (
str
, optional) — Output directory for the dataset. Default to this builder’scache_dir
, which is inside~/.cache/huggingface/datasets
by default.Added in 2.5.0
-
download_config (
DownloadConfig
, optional) — Specific download configuration parameters. -
download_mode (
DownloadMode
, optional) — Select the download/generate mode, default toREUSE_DATASET_IF_EXISTS
. -
ignore_verifications (
bool
) — Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…). -
try_from_hf_gcs (
bool
) — IfTrue
, it will try to download the already prepared dataset from the HF Google cloud storage. -
dl_manager (
DownloadManager
, optional) — SpecificDownloadManger
to use. -
base_path (
str
, optional) — Base path for relative paths that are used to download files. This can be a remote url. If not specified, the value of thebase_path
attribute (self.base_path
) will be used instead. -
use_auth_token (
Union[str, bool]
, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, or not specified, will get token from ~/.huggingface.Deprecated in 2.7.1
Pass
use_auth_token
to the initializer/load_dataset_builder
instead. -
file_format (
str
, optional) — Format of the data files in which the dataset will be written. Supported formats: “arrow”, “parquet”. Default to “arrow” format. If the format is “parquet”, then image and audio data are embedded into the Parquet files instead of pointing to local files.Added in 2.5.0
-
max_shard_size (
Union[str, int]
, optional) — Maximum number of bytes written per shard, default is “500MB”. The size is based on uncompressed data size, so in practice your shard files may be smaller thanmax_shard_size
thanks to Parquet compression for example.Added in 2.5.0
-
num_proc (
int
, optional, defaults toNone
) — Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default.Added in 2.7.0
-
storage_options (
dict
, optional) — Key/value pairs to be passed on to the caching file-system backend, if any.Added in 2.5.0
- **download_and_prepare_kwargs (additional keyword arguments) — Keyword arguments.
Downloads and prepares dataset for reading.
Example:
Download and prepare the dataset as Arrow files that can be loaded as a Dataset using builder.as_dataset()
:
>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("rotten_tomatoes")
>>> ds = builder.download_and_prepare()
Download and prepare the dataset as sharded Parquet files locally:
>>> from datasets import load_dataset_builder
>>> builder = load_dataset_builder("rotten_tomatoes")
>>> ds = builder.download_and_prepare("./output_dir", file_format="parquet")
Download and prepare the dataset as sharded Parquet files in a cloud storage:
>>> from datasets import load_dataset_builder
>>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key}
>>> builder = load_dataset_builder("rotten_tomatoes")
>>> ds = builder.download_and_prepare("s3://my-bucket/my_rotten_tomatoes", storage_options=storage_options, file_format="parquet")
Empty dict if doesn’t exist
Example:
>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder('rotten_tomatoes')
>>> ds_builder.get_all_exported_dataset_infos()
{'default': DatasetInfo(description="Movie Review Dataset.
a dataset of containing 5,331 positive and 5,331 negative processed
s from Rotten Tomatoes movie reviews. This data was first used in Bo
Lillian Lee, ``Seeing stars: Exploiting class relationships for
t categorization with respect to rating scales.'', Proceedings of the
5.
ion='@InProceedings{Pang+Lee:05a,
= {Bo Pang and Lillian Lee},
= {Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales},
tle = {Proceedings of the ACL},
2005
age='http://www.cs.cornell.edu/people/pabo/movie-review-data/', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='', output=''), task_templates=[TextClassification(task='text-classification', text_column='text', label_column='label')], builder_name='rotten_tomatoes_movie_review', config_name='default', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, dataset_name='rotten_tomatoes_movie_review'), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, dataset_name='rotten_tomatoes_movie_review'), 'test': SplitInfo(name='test', num_bytes=135972, num_examples=1066, dataset_name='rotten_tomatoes_movie_review')}, download_checksums={'https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz': {'num_bytes': 487770, 'checksum': 'a05befe52aafda71d458d188a1c54506a998b1308613ba76bbda2e5029409ce9'}}, download_size=487770, post_processing_size=None, dataset_size=1345461, size_in_bytes=1833231)}
Empty DatasetInfo
if doesn’t exist
Example:
>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder('rotten_tomatoes')
>>> ds_builder.get_exported_dataset_info()
DatasetInfo(description="Movie Review Dataset.
a dataset of containing 5,331 positive and 5,331 negative processed
s from Rotten Tomatoes movie reviews. This data was first used in Bo
Lillian Lee, ``Seeing stars: Exploiting class relationships for
t categorization with respect to rating scales.'', Proceedings of the
5.
ion='@InProceedings{Pang+Lee:05a,
= {Bo Pang and Lillian Lee},
= {Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales},
tle = {Proceedings of the ACL},
2005
age='http://www.cs.cornell.edu/people/pabo/movie-review-data/', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=SupervisedKeysData(input='', output=''), task_templates=[TextClassification(task='text-classification', text_column='text', label_column='label')], builder_name='rotten_tomatoes_movie_review', config_name='default', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, dataset_name='rotten_tomatoes_movie_review'), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, dataset_name='rotten_tomatoes_movie_review'), 'test': SplitInfo(name='test', num_bytes=135972, num_examples=1066, dataset_name='rotten_tomatoes_movie_review')}, download_checksums={'https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz': {'num_bytes': 487770, 'checksum': 'a05befe52aafda71d458d188a1c54506a998b1308613ba76bbda2e5029409ce9'}}, download_size=487770, post_processing_size=None, dataset_size=1345461, size_in_bytes=1833231)
Return the path of the module of this class or subclass.
Base class for datasets with data generation based on dict generators.
GeneratorBasedBuilder
is a convenience class that abstracts away much
of the data writing and reading of DatasetBuilder
. It expects subclasses to
implement generators of feature dictionaries across the dataset splits
(_split_generators
). See the method docstrings for details.
class datasets.BeamBasedBuilder
< source >( *args beam_runner = None beam_options = None **kwargs )
Beam-based Builder.
class datasets.ArrowBasedBuilder
< source >( cache_dir: typing.Optional[str] = None config_name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None repo_id: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None name = 'deprecated' **config_kwargs )
Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).
class datasets.BuilderConfig
< source >( name: str = 'default' version: typing.Union[str, datasets.utils.version.Version, NoneType] = 0.0.0 data_dir: typing.Optional[str] = None data_files: typing.Optional[datasets.data_files.DataFilesDict] = None description: typing.Optional[str] = None )
Base class for DatasetBuilder
data configuration.
DatasetBuilder
subclasses with data configuration options should subclass
BuilderConfig
and add their own properties.
create_config_id
< source >( config_kwargs: dict custom_features: typing.Optional[datasets.features.features.Features] = None )
The config id is used to build the cache directory. By default it is equal to the config name. However the name of a config is not sufficient to have a unique identifier for the dataset being generated since it doesn’t take into account:
- the config kwargs that can be used to overwrite attributes
- the custom features used to write the dataset
- the data_files for json/text/csv/pandas datasets
Therefore the config id is just the config name with an optional suffix based on these.
Download
class datasets.DownloadManager
< source >( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None base_path: typing.Optional[str] = None record_checksums = True )
download
< source >(
url_or_urls
)
→
str
or list
or dict
Download given URL(s).
By default, only one process is used for download. Pass customized download_config.num_proc
to change this behavior.
download_and_extract
< source >( url_or_urls ) → extracted_path(s)
Download and extract given url_or_urls
.
download_custom
< source >( url_or_urls custom_download ) → downloaded_path(s)
Parameters
-
url_or_urls (
str
orlist
ordict
) — URL orlist
ordict
of URLs to download and extract. Each URL is astr
. -
custom_download (
Callable[src_url, dst_path]
) — The source URL and destination path. For exampletf.io.gfile.copy
, that lets you download from Google storage.
Returns
downloaded_path(s)
str
, The downloaded paths matching the given input
url_or_urls
.
Download given urls(s) by calling custom_download
.
extract
< source >( path_or_paths num_proc = 'deprecated' ) → extracted_path(s)
Parameters
-
path_or_paths (path or
list
ordict
) — Path of file to extract. Each path is astr
. -
num_proc (
int
) — Use multi-processing ifnum_proc
> 1 and the length ofpath_or_paths
is larger thannum_proc
.Deprecated in 2.6.2
Pass
DownloadConfig(num_proc=<num_proc>)
to the initializer instead.
Returns
extracted_path(s)
str
, The extracted paths matching the given input
path_or_paths.
Extract given path(s).
iter_archive
< source >(
path_or_buf: typing.Union[str, _io.BufferedReader]
)
→
tuple[str, io.BufferedReader]
Iterate over files within an archive.
iter_files
< source >(
paths: typing.Union[str, typing.List[str]]
)
→
str
Iterate over file paths.
ship_files_with_pipeline
< source >( downloaded_path_or_paths pipeline )
Ship the files using Beam FileSystems to the pipeline temp dir.
class datasets.StreamingDownloadManager
< source >( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None base_path: typing.Optional[str] = None )
Download manager that uses the ”::” separator to navigate through (possibly remote) compressed archives.
Contrary to the regular DownloadManager
, the download
and extract
methods don’t actually download nor extract
data, but they rather return the path or url that could be opened using the xopen
function which extends the
built-in open
function to stream data from remote files.
download
< source >( url_or_urls ) → url(s)
Normalize URL(s) of files to stream data from.
This is the lazy version of DownloadManager.download
for streaming.
download_and_extract
< source >( url_or_urls ) → url(s)
Prepare given url_or_urls
for streaming (add extraction protocol).
This is the lazy version of DownloadManager.download_and_extract
for streaming.
extract
< source >( url_or_urls ) → url(s)
Add extraction protocol for given url(s) for streaming.
This is the lazy version of DownloadManager.extract
for streaming.
iter_archive
< source >(
urlpath_or_buf: typing.Union[str, _io.BufferedReader]
)
→
tuple[str, io.BufferedReader]
Iterate over files within an archive.
iter_files
< source >( urlpaths: typing.Union[str, typing.List[str]] ) → str
Iterate over files.
class datasets.DownloadConfig
< source >( cache_dir: typing.Union[str, pathlib.Path, NoneType] = None force_download: bool = False resume_download: bool = False local_files_only: bool = False proxies: typing.Optional[typing.Dict] = None user_agent: typing.Optional[str] = None extract_compressed_file: bool = False force_extract: bool = False delete_extracted: bool = False use_etag: bool = True num_proc: typing.Optional[int] = None max_retries: int = 1 use_auth_token: typing.Union[bool, str, NoneType] = None ignore_url_params: bool = False download_desc: typing.Optional[str] = None )
Parameters
-
cache_dir (
str
orPath
, optional) — Specify a cache directory to save the file to (overwrite the default cache dir). -
force_download (
bool
, defaults toFalse
) — IfTrue
, re-dowload the file even if it’s already cached in the cache dir. -
resume_download (
bool
, defaults toFalse
) — IfTrue
, resume the download if incompletly recieved file is found. -
proxies (
dict
, optional) — -
user_agent (
str
, optional) — Optional string or dict that will be appended to the user-agent on remote requests. -
extract_compressed_file (
bool
, defaults toFalse
) — IfTrue
and the path point to a zip or tar file, extract the compressed file in a folder along the archive. -
force_extract (
bool
, defaults toFalse
) — IfTrue
whenextract_compressed_file
isTrue
and the archive was already extracted, re-extract the archive and override the folder where it was extracted. -
delete_extracted (
bool
, defaults toFalse
) — Whether to delete (or keep) the extracted files. -
use_etag (
bool
, defaults toTrue
) — Whether to use the ETag HTTP response header to validate the cached files. -
num_proc (
int
, optional) — The number of processes to launch to download the files in parallel. -
max_retries (
int
, default to1
) — The number of times to retry an HTTP request if it fails. -
use_auth_token (
str
orbool
, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. IfTrue
, or not specified, will get token from~/.huggingface
. -
ignore_url_params (
bool
, defaults toFalse
) — Whether to strip all query parameters and fragments from the download URL before using it for caching the file. -
download_desc (
str
, optional) — A description to be displayed alongside with the progress bar while downloading the files.
Configuration for our cached path manager.
class datasets.DownloadMode
< source >( value names = None module = None qualname = None type = None start = 1 )
Enum
for how to treat pre-existing downloads and data.
The default mode is REUSE_DATASET_IF_EXISTS
, which will reuse both
raw downloads and the prepared dataset if they exist.
The generations modes:
Downloads | Dataset | |
---|---|---|
REUSE_DATASET_IF_EXISTS (default) |
Reuse | Reuse |
REUSE_CACHE_IF_EXISTS |
Reuse | Fresh |
FORCE_REDOWNLOAD |
Fresh | Fresh |
Splits
class datasets.SplitGenerator
< source >( name: str gen_kwargs: typing.Dict = <factory> )
Defines the split information for the generator.
This should be used as returned value of
GeneratorBasedBuilder._split_generators
.
See GeneratorBasedBuilder._split_generators
for more info and example
of usage.
Enum
for dataset splits.
Datasets are typically split into different subsets to be used at various stages of training and evaluation.
TRAIN
: the training data.VALIDATION
: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).TEST
: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.ALL
: the union of all defined dataset splits.
All splits, including compositions inherit from datasets.SplitBase
.
See the guide on splits for more information.
Example:
>>> datasets.SplitGenerator(
... name=datasets.Split.TRAIN,
... gen_kwargs={"split_key": "train", "files": dl_manager.download_and extract(url)},
... ),
... datasets.SplitGenerator(
... name=datasets.Split.VALIDATION,
... gen_kwargs={"split_key": "validation", "files": dl_manager.download_and extract(url)},
... ),
... datasets.SplitGenerator(
... name=datasets.Split.TEST,
... gen_kwargs={"split_key": "test", "files": dl_manager.download_and extract(url)},
... )
Descriptor corresponding to a named split (train, test, …).
Example:
Each descriptor can be composed with other using addition or slice:
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST
The resulting split will correspond to 25% of the train split merged with 100% of the test split.
A split cannot be added twice, so the following will fail:
split = (
datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
datasets.Split.TRAIN.subsplit(datasets.percent[75:])
) # Error
split = datasets.Split.TEST + datasets.Split.ALL # Error
Split corresponding to the union of all defined dataset splits.
class datasets.ReadInstruction
< source >( split_name rounding = None from_ = None to = None unit = None )
Reading instruction for a dataset.
Examples:
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%'))
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%]+train[1:-1]'))
ds = datasets.load_dataset('mnist', split=(
datasets.ReadInstruction('test', to=33, unit='%') +
datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%](pct1_dropremainder)'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))
# 10-fold validation:
tests = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
for k in range(0, 100, 10)])
from_spec
< source >( spec )
Creates a ReadInstruction
instance out of a string spec.
Examples:
test: test split.
test + validation: test split + validation split.
test[10:]: test split, minus its first 10 records.
test[:10%]: first 10% records of test split.
test[:20%](pct1_dropremainder): first 10% records, rounded with the pct1_dropremainder rounding.
test[:-5%]+train[40%:60%]: first 95% of test + middle 20% of train.
to_absolute
< source >( name2len )
Translate instruction into a list of absolute instructions.
Those absolute instructions are then to be added together.
Version
class datasets.Version
< source >( version_str: str description: typing.Optional[str] = None major: typing.Union[str, int, NoneType] = None minor: typing.Union[str, int, NoneType] = None patch: typing.Union[str, int, NoneType] = None )
Dataset version MAJOR.MINOR.PATCH
.