Builder classes
π€ Datasets relies on two main classes during the dataset building process: datasets.DatasetBuilder and datasets.BuilderConfig.
class datasets.DatasetBuilder
< source >( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None namespace: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )
Abstract base class for all datasets.
DatasetBuilder has 3 key methods:
datasets.DatasetBuilder.info
: Documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.- datasets.DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.
- datasets.DatasetBuilder.as_dataset(): Generates a Dataset.
Configuration: Some DatasetBuilders expose multiple variants of the
dataset by defining a datasets.BuilderConfig subclass and accepting a
config object (or name) on construction. Configurable datasets expose a
pre-defined set of configurations in datasets.DatasetBuilder.builder_configs()
as_dataset
< source >( split: typing.Optional[datasets.splits.Split] = None run_post_process = True ignore_verifications = False in_memory = False )
Parameters
-
split (
datasets.Split
) — Which subset of the data to return. - run_post_process (bool, default=True) — Whether to run post-processing dataset transforms and/or add indexes.
- ignore_verifications (bool, default=False) — Whether to ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…).
- in_memory (bool, default=False) — Whether to copy the data in-memory.
Return a Dataset for the specified split.
download_and_prepare
< source >( download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None download_mode: typing.Optional[datasets.utils.download_manager.DownloadMode] = None ignore_verifications: bool = False try_from_hf_gcs: bool = True dl_manager: typing.Optional[datasets.utils.download_manager.DownloadManager] = None base_path: typing.Optional[str] = None use_auth_token: typing.Union[str, bool, NoneType] = None **download_and_prepare_kwargs )
Parameters
- download_config (DownloadConfig, optional) — specific download configuration parameters.
-
download_mode (DownloadMode, optional) — select the download/generate mode - Default to
REUSE_DATASET_IF_EXISTS
-
ignore_verifications (
bool
) — Ignore the verifications of the downloaded/processed dataset information (checksums/size/splits/…) -
save_infos (
bool
) — Save the dataset information (checksums/size/splits/…) -
try_from_hf_gcs (
bool
) — If True, it will try to download the already prepared dataset from the Hf google cloud storage - dl_manager (DownloadManager, optional) — specific Download Manger to use
-
base_path (
str
, optional) — base path for relative paths that are used to download files. This can be a remote url. If not specified, the value of the base_path attribute (self.base_path) will be used instead. -
use_auth_token (
Union[str, bool]
, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.
Downloads and prepares dataset for reading.
Empty dict if doesnβt exist
Empty DatasetInfo if doesnβt exist
Return the path of the module of this class or subclass.
Base class for datasets with data generation based on dict generators.
GeneratorBasedBuilder
is a convenience class that abstracts away much
of the data writing and reading of DatasetBuilder
. It expects subclasses to
implement generators of feature dictionaries across the dataset splits
(_split_generators
). See the method docstrings for details.
Beam based Builder.
class datasets.ArrowBasedBuilder
< source >( cache_dir: typing.Optional[str] = None name: typing.Optional[str] = None hash: typing.Optional[str] = None base_path: typing.Optional[str] = None info: typing.Optional[datasets.info.DatasetInfo] = None features: typing.Optional[datasets.features.features.Features] = None use_auth_token: typing.Union[str, bool, NoneType] = None namespace: typing.Optional[str] = None data_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = None data_dir: typing.Optional[str] = None **config_kwargs )
Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).
class datasets.BuilderConfig
< source >( name: str = 'default' version: typing.Union[str, datasets.utils.version.Version, NoneType] = '0.0.0' data_dir: typing.Optional[str] = None data_files: typing.Optional[datasets.data_files.DataFilesDict] = None description: typing.Optional[str] = None )
Parameters
-
name (
str
, default"default"
) — -
version (Version or
str
, optional) — -
data_dir (
str
, optional) — -
data_files (
str
orSequence
orMapping
, optional) — Path(s) to source data file(s). -
description (
str
, optional) —
Base class for DatasetBuilder data configuration.
DatasetBuilder subclasses with data configuration options should subclass BuilderConfig and add their own properties.
create_config_id
< source >( config_kwargs: dict custom_features: typing.Optional[datasets.features.features.Features] = None )
The config id is used to build the cache directory. By default it is equal to the config name. However the name of a config is not sufficient to have a unique identifier for the dataset being generated since it doesnβt take into account:
- the config kwargs that can be used to overwrite attributes
- the custom features used to write the dataset
- the data_files for json/text/csv/pandas datasets Therefore the config id is just the config name with an optional suffix based on these.
class datasets.DownloadManager
< source >( dataset_name: typing.Optional[str] = None data_dir: typing.Optional[str] = None download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None base_path: typing.Optional[str] = None record_checksums: bool = True )
download
< source >( url_or_urls ) β downloaded_path(s)
Returns
downloaded_path(s)
str
, The downloaded paths matching the given input
url_or_urls.
Download given url(s).
download_and_extract
< source >( url_or_urls ) β extracted_path(s)
Returns
extracted_path(s)
str
, extracted paths of given URL(s).
Download and extract given url_or_urls.
Is roughly equivalent to:
extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))
download_custom
< source >( url_or_urls custom_download ) β downloaded_path(s)
Returns
downloaded_path(s)
str
, The downloaded paths matching the given input
url_or_urls.
Download given urls(s) by calling custom_download
.
extract
< source >( path_or_paths num_proc = None ) β extracted_path(s)
Returns
extracted_path(s)
str
, The extracted paths matching the given input
path_or_paths.
Extract given path(s).
iter_archive
< source >( path_or_buf: typing.Union[str, _io.BufferedReader] )
Iterate over files within an archive.
iter_files
< source >( paths: typing.Union[str, typing.List[str]] )
Iterate over file paths.
Ship the files using Beam FileSystems to the pipeline temp dir.
class datasets.DownloadMode
< source >( value names = None module = None qualname = None type = None start = 1 )
Enum
for how to treat pre-existing downloads and data.
The default mode is REUSE_DATASET_IF_EXISTS
, which will reuse both
raw downloads and the prepared dataset if they exist.
The generations modes:
Downloads | Dataset | |
---|---|---|
REUSE_DATASET_IF_EXISTS (default) |
Reuse | Reuse |
REUSE_CACHE_IF_EXISTS |
Reuse | Fresh |
FORCE_REDOWNLOAD |
Fresh | Fresh |
class datasets.SplitGenerator
< source >( name: str gen_kwargs: typing.Dict = <factory> )
Defines the split information for the generator.
This should be used as returned value of
GeneratorBasedBuilder._split_generators()
See GeneratorBasedBuilder._split_generators()
for more info and example
of usage.
Enum
for dataset splits.
Datasets are typically split into different subsets to be used at various stages of training and evaluation.
TRAIN
: the training data.VALIDATION
: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).TEST
: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.ALL
: the union of all defined dataset splits.
Note: All splits, including compositions inherit from datasets.SplitBase
See the :doc:guide on splits </loading>
for more information.
Descriptor corresponding to a named split (train, test, β¦).
Example:
Each descriptor can be composed with other using addition or slice. Ex
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST
The resulting split will correspond to 25% of the train split merged with
100% of the test split.
Warning: A split cannot be added twice, so the following will fail:
split = (
datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
datasets.Split.TRAIN.subsplit(datasets.percent[75:])
) # Error
split = datasets.Split.TEST + datasets.Split.ALL # Error
Warning: The slices can be applied only one time. So the following are valid:
split = (
datasets.Split.TRAIN.subsplit(datasets.percent[:25]) +
datasets.Split.TEST.subsplit(datasets.percent[:50])
)
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])
But not:
train = datasets.Split.TRAIN
test = datasets.Split.TEST
split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25])
split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])
Split corresponding to the union of all defined dataset splits.
class datasets.ReadInstruction
< source >( split_name rounding = None from_ = None to = None unit = None )
Reading instruction for a dataset.
Examples:
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%'))
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%]+train[1:-1]'))
ds = datasets.load_dataset('mnist', split=(
datasets.ReadInstruction('test', to=33, unit='%') +
datasets.ReadInstruction('train', from_=1, to=-1, unit='abs')))
# The following lines are equivalent:
ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)')
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec(
'test[:33%](pct1_dropremainder)'))
ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction(
'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder"))
# 10-fold validation:
tests = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains = datasets.load_dataset(
'mnist',
[datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%')
for k in range(0, 100, 10)])
from_spec
< source >( spec )
Parameters
-
spec (str) — split(s) + optional slice(s) to read + optional rounding
if percents are used as the slicing unit. A slice can be specified,
using absolute numbers (int) or percentages (int). E.g.
test
: test split.test + validation
: test split + validation split.test[10:]
: test split, minus its first 10 records.test[:10%]
: first 10% records of test split.test[:20%](pct1_dropremainder)
: first 10% records, rounded with thepct1_dropremainder
rounding.test[:-5%]+train[40%:60%]
: first 95% of test + middle 20% of train.
Creates a ReadInstruction instance out of a string spec.
Translate instruction into a list of absolute instructions.
Those absolute instructions are then to be added together.
class datasets.DownloadConfig
< source >( cache_dir: typing.Union[pathlib.Path, str, NoneType] = None force_download: bool = False resume_download: bool = False local_files_only: bool = False proxies: typing.Optional[typing.Dict] = None user_agent: typing.Optional[str] = None extract_compressed_file: bool = False force_extract: bool = False delete_extracted: bool = False use_etag: bool = True num_proc: typing.Optional[int] = None max_retries: int = 1 use_auth_token: typing.Union[bool, str, NoneType] = None ignore_url_params: bool = False download_desc: typing.Optional[str] = None )
Parameters
-
cache_dir (
str
orPath
, optional) — Specify a cache directory to save the file to (overwrite the default cache dir). -
force_download (
bool
, defaultFalse
) — If True, re-dowload the file even if it’s already cached in the cache dir. -
resume_download (
bool
, defaultFalse
) — If True, resume the download if incompletly recieved file is found. -
proxies (
dict
, optional) — -
user_agent (
str
, optional) — Optional string or dict that will be appended to the user-agent on remote requests. -
extract_compressed_file (
bool
, defaultFalse
) — If True and the path point to a zip or tar file, extract the compressed file in a folder along the archive. -
force_extract (
bool
, defaultFalse
) — If True when extract_compressed_file is True and the archive was already extracted, re-extract the archive and override the folder where it was extracted. -
delete_extracted (
bool
, defaultFalse
) — Whether to delete (or keep) the extracted files. -
use_etag (
bool
, defaultTrue
) — Whether to use the ETag HTTP response header to validate the cached files. -
num_proc (
int
, optional) — The number of processes to launch to download the files in parallel. -
max_retries (
int
, default1
) — The number of times to retry an HTTP request if it fails. -
use_auth_token (
str
orbool
, optional) — Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface. -
ignore_url_params (
bool
, defaultFalse
) — Whether to strip all query parameters and #fragments from the download URL before using it for caching the file. -
download_desc (
str
, optional) — A description to be displayed alongside with the progress bar while downloading the files.
Configuration for our cached path manager.
class datasets.Version
< source >( version_str: str description: typing.Optional[str] = None major: typing.Union[str, int, NoneType] = None minor: typing.Union[str, int, NoneType] = None patch: typing.Union[str, int, NoneType] = None )
Dataset version MAJOR.MINOR.PATCH.
Returns True if other_version matches.