Main classes

DatasetInfo

class datasets.DatasetInfo

( description: str = <factory> citation: str = <factory> homepage: str = <factory> license: str = <factory> features: Optional = None post_processed: Optional = None supervised_keys: Optional = None task_templates: Optional = None builder_name: Optional = None dataset_name: Optional = None config_name: Optional = None version: Union = None splits: Optional = None download_checksums: Optional = None download_size: Optional = None post_processing_size: Optional = None dataset_size: Optional = None size_in_bytes: Optional = None )

Parameters

description (str) — A description of the dataset.
citation (str) — A BibTeX citation of the dataset.
homepage (str) — A URL to the official homepage for the dataset.
license (str) — The dataset’s license. It can be the name of the license or a paragraph containing the terms of the license.
features (Features, optional) — The features used to specify the dataset’s column types.
post_processed (PostProcessedInfo, optional) — Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index.
supervised_keys (SupervisedKeysData, optional) — Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).
builder_name (str, optional) — The name of the GeneratorBasedBuilder subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.
config_name (str, optional) — The name of the configuration derived from BuilderConfig.
version (str or Version, optional) — The version of the dataset.
splits (dict, optional) — The mapping between split name and metadata.
download_checksums (dict, optional) — The mapping between the URL to download the dataset’s checksums and corresponding metadata.
download_size (int, optional) — The size of the files to download to generate the dataset, in bytes.
post_processing_size (int, optional) — Size of the dataset in bytes after post-processing, if any.
dataset_size (int, optional) — The combined size in bytes of the Arrow tables for all splits.
size_in_bytes (int, optional) — The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files).
task_templates (List[TaskTemplate], optional) — The task templates to prepare the dataset for during training and evaluation. Each template casts the dataset’s Features to standardized column names and types as detailed in datasets.tasks.
**config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.

Information about a dataset.

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Not all fields are known on construction and may be updated later.

Datasets

Main classes

DatasetInfo

class datasets.DatasetInfo

from_directory

write_to_directory

Dataset

class datasets.Dataset

add_column

add_item

from_file

from_buffer

from_pandas

from_dict

from_generator

data

cache_files

num_columns

num_rows

column_names

shape

unique

flatten

cast

cast_column

remove_columns

rename_column

rename_columns

select_columns

class_encode_column

__len__

__iter__

iter

formatted_as

set_format

set_transform

reset_format

with_format

with_transform

__getitem__

cleanup_cache_files

map

filter

select

sort

shuffle

train_test_split

shard

to_tf_dataset

push_to_hub

save_to_disk

load_from_disk

flatten_indices

to_csv

to_pandas

to_dict

to_json

to_parquet

to_sql

to_iterable_dataset

add_faiss_index

add_faiss_index_from_external_arrays

save_faiss_index

load_faiss_index

add_elasticsearch_index

load_elasticsearch_index

list_indexes

get_index

drop_index

search

search_batch

get_nearest_examples

get_nearest_examples_batch

info

split

builder_name

citation

config_name

dataset_size

description

len

iter

getitem

iter