Classes used during the dataset building process¶
Two main classes are mostly used during the dataset building process.
-
class
datasets.
DatasetBuilder
(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, base_path: Optional[str] = None, features: Optional[datasets.features.Features] = None, use_auth_token: Optional[Union[bool, str]] = None, **config_kwargs)[source]¶ Abstract base class for all datasets.
DatasetBuilder has 3 key methods:
datasets.DatasetBuilder.info()
: Documents the dataset, including feature names, types, and shapes, version, splits, citation, etc.datasets.DatasetBuilder.download_and_prepare()
: Downloads the source data and writes it to disk.datasets.DatasetBuilder.as_dataset()
: Generates a Dataset.
Configuration: Some DatasetBuilder`s expose multiple variants of the dataset by defining a `datasets.BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in
datasets.DatasetBuilder.builder_configs()
.
-
class
datasets.
GeneratorBasedBuilder
(*args, writer_batch_size=None, **kwargs)[source]¶ Base class for datasets with data generation based on dict generators.
GeneratorBasedBuilder is a convenience class that abstracts away much of the data writing and reading of DatasetBuilder. It expects subclasses to implement generators of feature dictionaries across the dataset splits (_split_generators). See the method docstrings for details.
-
class
datasets.
ArrowBasedBuilder
(cache_dir: Optional[str] = None, name: Optional[str] = None, hash: Optional[str] = None, base_path: Optional[str] = None, features: Optional[datasets.features.Features] = None, use_auth_token: Optional[Union[bool, str]] = None, **config_kwargs)[source]¶ Base class for datasets with data generation based on Arrow loading functions (CSV/JSON/Parquet).
-
class
datasets.
BuilderConfig
(name: str = 'default', version: Optional[Union[str, datasets.utils.version.Version]] = '0.0.0', data_dir: Optional[str] = None, data_files: Optional[Union[str, Dict, List, Tuple]] = None, description: Optional[str] = None)[source]¶ Base class for
DatasetBuilder
data configuration.DatasetBuilder subclasses with data configuration options should subclass
BuilderConfig
and add their own properties.- Variables
name (
str
, default"default"
) –version (
Version
orstr
, optional) –data_dir (
str
, optional) –data_files (
str
ordict
orlist
ortuple
, optional) –description (
str
, optional) –
-
class
datasets.
DownloadManager
(dataset_name: Optional[str] = None, data_dir: Optional[str] = None, download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, base_path: Optional[str] = None)[source]¶
-
class
datasets.
GenerateMode
(value)[source]¶ Enum for how to treat pre-existing downloads and data.
The default mode is REUSE_DATASET_IF_EXISTS, which will reuse both raw downloads and the prepared dataset if they exist.
The generations modes:
Downloads
Dataset
REUSE_DATASET_IF_EXISTS (default)
Reuse
Reuse
REUSE_CACHE_IF_EXISTS
Reuse
Fresh
FORCE_REDOWNLOAD
Fresh
Fresh
-
class
datasets.
SplitGenerator
(name: str, gen_kwargs: Dict = <factory>)[source]¶ Defines the split information for the generator.
This should be used as returned value of
GeneratorBasedBuilder._split_generators()
. SeeGeneratorBasedBuilder._split_generators()
for more info and example of usage.- Parameters
name (str) – Name of the Split for which the generator will create the examples.
**gen_kwargs – Keyword arguments to forward to the
DatasetBuilder._generate_examples()
method of the builder.
-
class
datasets.
Split
(name)[source]¶ Enum for dataset splits.
Datasets are typically split into different subsets to be used at various stages of training and evaluation.
TRAIN: the training data.
VALIDATION: the validation data. If present, this is typically used as evaluation data while iterating on a model (e.g. changing hyperparameters, model architecture, etc.).
TEST: the testing data. This is the data to report metrics on. Typically you do not want to use this during model iteration as you may overfit to it.
ALL: the union of all defined dataset splits.
Note: All splits, including compositions inherit from datasets.SplitBase
See the guide on splits for more information.
-
class
datasets.
NamedSplit
(name)[source]¶ Descriptor corresponding to a named split (train, test, …).
Examples
Each descriptor can be composed with other using addition or slice. Ex:
split = datasets.Split.TRAIN.subsplit(datasets.percent[0:25]) + datasets.Split.TEST
The resulting split will correspond to 25% of the train split merged with 100% of the test split.
Warning
A split cannot be added twice, so the following will fail:
split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TRAIN.subsplit(datasets.percent[75:]) ) # Error split = datasets.Split.TEST + datasets.Split.ALL # Error
Warning
The slices can be applied only one time. So the following are valid:
split = ( datasets.Split.TRAIN.subsplit(datasets.percent[:25]) + datasets.Split.TEST.subsplit(datasets.percent[:50]) ) split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:50])
But not:
train = datasets.Split.TRAIN test = datasets.Split.TEST split = train.subsplit(datasets.percent[:25]).subsplit(datasets.percent[:25]) split = (train.subsplit(datasets.percent[:25]) + test).subsplit(datasets.percent[:50])
-
class
datasets.
NamedSplitAll
[source]¶ Split corresponding to the union of all defined dataset splits.
-
class
datasets.
ReadInstruction
(split_name, rounding=None, from_=None, to=None, unit=None)[source]¶ Reading instruction for a dataset.
Examples of usage:
# The following lines are equivalent: ds = datasets.load_dataset('mnist', split='test[:33%]') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec('test[:33%]')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction('test', to=33, unit='%')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( 'test', from_=0, to=33, unit='%')) # The following lines are equivalent: ds = datasets.load_dataset('mnist', split='test[:33%]+train[1:-1]') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( 'test[:33%]+train[1:-1]')) ds = datasets.load_dataset('mnist', split=( datasets.ReadInstruction('test', to=33, unit='%') + datasets.ReadInstruction('train', from_=1, to=-1, unit='abs'))) # The following lines are equivalent: ds = datasets.load_dataset('mnist', split='test[:33%](pct1_dropremainder)') ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction.from_spec( 'test[:33%](pct1_dropremainder)')) ds = datasets.load_dataset('mnist', split=datasets.ReadInstruction( 'test', from_=0, to=33, unit='%', rounding="pct1_dropremainder")) # 10-fold validation: tests = datasets.load_dataset( 'mnist', [datasets.ReadInstruction('train', from_=k, to=k+10, unit='%') for k in range(0, 100, 10)]) trains = datasets.load_dataset( 'mnist', [datasets.ReadInstruction('train', to=k, unit='%') + datasets.ReadInstruction('train', from_=k+10, unit='%') for k in range(0, 100, 10)])
-
class
datasets.utils.
DownloadConfig
(cache_dir: Optional[Union[pathlib.Path, str]] = None, force_download: bool = False, resume_download: bool = False, local_files_only: bool = False, proxies: Optional[Dict] = None, user_agent: Optional[str] = None, extract_compressed_file: bool = False, force_extract: bool = False, delete_extracted: bool = False, use_etag: bool = True, num_proc: Optional[int] = None, max_retries: int = 1, use_auth_token: Optional[Union[bool, str]] = None)[source]¶ Configuration for our cached path manager.
- Variables
cache_dir (
str
orPath
, optional) – Specify a cache directory to save the file to (overwrite the default cache dir).force_download (
bool
, defaultFalse
) – If True, re-dowload the file even if it’s already cached in the cache dir.resume_download (
bool
, defaultFalse
) – If True, resume the download if incompletly recieved file is found.proxies (
dict
, optional) –user_agent (
str
, optional) – Optional string or dict that will be appended to the user-agent on remote requests.extract_compressed_file (
bool
, defaultFalse
) – If True and the path point to a zip or tar file, extract the compressed file in a folder along the archive.force_extract (
bool
, defaultFalse
) – If True when extract_compressed_file is True and the archive was already extracted, re-extract the archive and override the folder where it was extracted.delete_extracted (
bool
, defaultFalse
) – Whether to delete (or keep) the extracted files.use_etag (
bool
, defaultTrue
) –num_proc (
int
, optional) –max_retries (
int
, default1
) – The number of times to retry an HTTP request if it fails.use_auth_token (
str
orbool
, optional) – Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from ~/.huggingface.
-
class
datasets.utils.
Version
(version_str: str, description: Optional[str] = None, major: Optional[Union[str, int]] = None, minor: Optional[Union[str, int]] = None, patch: Optional[Union[str, int]] = None)[source]¶ Dataset version MAJOR.MINOR.PATCH.
- Parameters
version_str (
str
) – Eg: “1.2.3”.description (
str
) – A description of what is new in this version.
- Variables
version_str (
str
) – Eg: “1.2.3”.description (
str
) – A description of what is new in this version.major (
str
) –minor (
str
) –patch (
str
) –