Main classes

DatasetInfo

class datasets.DatasetInfo(description: str = <factory>, citation: str = <factory>, homepage: str = <factory>, license: str = <factory>, features: Optional[datasets.features.Features] = None, post_processed: Optional[datasets.info.PostProcessedInfo] = None, supervised_keys: Optional[datasets.info.SupervisedKeysData] = None, task_templates: Optional[List[datasets.tasks.base.TaskTemplate]] = None, builder_name: Optional[str] = None, config_name: Optional[str] = None, version: Optional[Union[str, datasets.utils.version.Version]] = None, splits: Optional[dict] = None, download_checksums: Optional[dict] = None, download_size: Optional[int] = None, post_processing_size: Optional[int] = None, dataset_size: Optional[int] = None, size_in_bytes: Optional[int] = None)[source]

Information about a dataset.

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Note: Not all fields are known on construction and may be updated later.

Variables
  • description (str) – A description of the dataset.

  • citation (str) – A BibTeX citation of the dataset.

  • homepage (str) – A URL to the official homepage for the dataset.

  • license (str) – The dataset’s license. It can be the name of the license or a paragraph containing the terms of the license.

  • features (Features, optional) – The features used to specify the dataset’s column types.

  • post_processed (PostProcessedInfo, optional) – Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index.

  • supervised_keys (SupervisedKeysData, optional) – Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).

  • builder_name (str, optional) – The name of the GeneratorBasedBuilder subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.

  • config_name (str, optional) – The name of the configuration derived from BuilderConfig

  • version (str or Version, optional) – The version of the dataset.

  • splits (dict, optional) – The mapping between split name and metadata.

  • download_checksums (dict, optional) – The mapping between the URL to download the dataset’s checksums and corresponding metadata.

  • download_size (int, optional) – The size of the files to download to generate the dataset, in bytes.

  • post_processing_size (int, optional) – Size of the dataset in bytes after post-processing, if any.

  • dataset_size (int, optional) – The combined size in bytes of the Arrow tables for all splits.

  • size_in_bytes (int, optional) – The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files).

  • task_templates (List[TaskTemplate], optional) – The task templates to prepare the dataset for during training and evaluation. Each template casts the dataset’s Features to standardized column names and types as detailed in datasets.tasks.

  • **config_kwargs – Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.

classmethod from_directory(dataset_info_dir: str) → datasets.info.DatasetInfo[source]

Create DatasetInfo from the JSON file in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.

This will overwrite all previous metadata.

Parameters

dataset_info_dir (str) – The directory containing the metadata file. This should be the root directory of a specific dataset version.

write_to_directory(dataset_info_dir)[source]

Write DatasetInfo as JSON to dataset_info_dir.

Also save the license separately in LICENCE.

Dataset

The base class datasets.Dataset implements a Dataset backed by an Apache Arrow table.

class datasets.Dataset(arrow_table: datasets.table.Table, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_table: Optional[datasets.table.Table] = None, fingerprint: Optional[str] = None)[source]

A Dataset backed by an Arrow table.

__getitem__(key: Union[int, slice, str]) → Union[Dict, List][source]

Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).

__iter__()[source]

Iterate through the examples.

If a formatting is set with Dataset.set_format() rows will be returned with the selected format.

__len__()[source]

Number of rows in the dataset.

add_column(name: str, column: Union[list, numpy.array], new_fingerprint: str)[source]

Add column to Dataset.

New in version 1.7.

Parameters
  • name (str) – Column name.

  • column (list or np.array) – Column data to be added.

Returns

Dataset

add_elasticsearch_index(column: str, index_name: Optional[str] = None, host: Optional[str] = None, port: Optional[int] = None, es_client: Optional[elasticsearch.Elasticsearch] = None, es_index_name: Optional[str] = None, es_index_config: Optional[dict] = None)[source]

Add a text index using ElasticSearch for fast retrieval. This is done in-place.

Parameters
  • column (str) – The column of the documents to add to the index.

  • index_name (Optional str) – The index_name/identifier of the index. This is the index name that is used to call Dataset.get_nearest_examples() or Dataset.search(). By default it corresponds to column.

  • host (Optional str, defaults to localhost) – host of where ElasticSearch is running

  • port (Optional str, defaults to 9200) – port of where ElasticSearch is running

  • es_client (Optional elasticsearch.Elasticsearch) – The elasticsearch client used to create the index if host and port are None.

  • es_index_name (Optional str) – The elasticsearch index name used to create the index.

  • es_index_config (Optional dict) –

    The configuration of the elasticsearch index. Default config is:

    {
        "settings": {
            "number_of_shards": 1,
            "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}},
        },
        "mappings": {
            "properties": {
                "text": {
                    "type": "text",
                    "analyzer": "standard",
                    "similarity": "BM25"
                },
            }
        },
    }
    

Example

es_client = elasticsearch.Elasticsearch()
ds = datasets.load_dataset('crime_and_punish', split='train')
ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
add_faiss_index(column: str, index_name: Optional[str] = None, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]

Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:

Parameters
  • column (str) – The column of the vectors to add to the index.

  • index_name (Optional str) – The index_name/identifier of the index. This is the index_name that is used to call datasets.Dataset.get_nearest_examples() or datasets.Dataset.search(). By default it corresponds to column.

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

  • string_factory (Optional str) – This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat.

  • metric_type (Optional int) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.

  • custom_index (Optional faiss.Index) – Custom Faiss index that you already have instantiated and configured for your needs.

  • train_size (Optional int) – If the index needs a training step, specifies how many vectors will be used to train the index.

  • faiss_verbose (bool, defaults to False) – Enable the verbosity of the Faiss index.

  • dtype (data-type) – The dtype of the numpy arrays that are indexed. Default is np.float32.

Example

ds = datasets.load_dataset('crime_and_punish', split='train')
ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
ds_with_embeddings.add_faiss_index(column='embeddings')
# query
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
# save index
ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

ds = datasets.load_dataset('crime_and_punish', split='train')
# load index
ds.load_faiss_index('embeddings', 'my_index.faiss')
# query
scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)
add_faiss_index_from_external_arrays(external_arrays: numpy.array, index_name: str, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]

Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:

Parameters
  • external_arrays (np.array) – If you want to use arrays from outside the lib for the index, you can set external_arrays. It will use external_arrays to create the Faiss index instead of the arrays in the given column.

  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call datasets.Dataset.get_nearest_examples() or datasets.Dataset.search().

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

  • string_factory (Optional str) – This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat.

  • metric_type (Optional int) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.

  • custom_index (Optional faiss.Index) – Custom Faiss index that you already have instantiated and configured for your needs.

  • train_size (Optional int) – If the index needs a training step, specifies how many vectors will be used to train the index.

  • faiss_verbose (bool, defaults to False) – Enable the verbosity of the Faiss index.

  • dtype (numpy.dtype) – The dtype of the numpy arrays that are indexed. Default is np.float32.

add_item(item: dict, new_fingerprint: str)[source]

Add item to Dataset.

New in version 1.7.

Parameters

item (dict) – Item data to be added.

Returns

Dataset

property cache_files

The cache files containing the Apache Arrow table backing the dataset.

cast(features: datasets.features.Features, batch_size: Optional[int] = 10000, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 10000, num_proc: Optional[int] = None) → datasets.arrow_dataset.Dataset[source]

Cast the dataset to a new set of features.

Parameters
  • features (datasets.Features) – New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

  • batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to cast. batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to cast.

  • keep_in_memory (bool, default False) – Whether to copy the data in-memory.

  • load_from_cache_file (bool, default True if caching is enabled) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • num_proc (Optional[int], default None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

Returns

Dataset – A copy of the dataset with casted features.

cast_(features: datasets.features.Features, batch_size: Optional[int] = 10000, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 10000, num_proc: Optional[int] = None)[source]

In-place version of Dataset.cast().

Deprecated since version 1.4.0: Use Dataset.cast() instead.

Parameters
  • features (datasets.Features) – New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

  • batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to cast. batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to cast.

  • keep_in_memory (bool, default False) – Whether to copy the data in-memory.

  • load_from_cache_file (bool, default True if caching is enabled) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • num_proc (Optional[int], default None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

class_encode_column(column: str) → datasets.arrow_dataset.Dataset[source]

Casts the given column as :obj:datasets.features.ClassLabel and updates the table.

Parameters

column (str) – The name of the column to cast (list all the column names with datasets.Dataset.column_names())

cleanup_cache_files() → int[source]

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.

Be careful when running this command that no other process is currently using other cache files.

Returns

int – Number of removed files.

property column_names

Names of the columns in the dataset.

property data

The Apache Arrow table backing the dataset.

drop_index(index_name: str)

Drop the index with the specified column.

Parameters

index_name (str) – The index_name/identifier of the index.

export(filename: str, format: str = 'tfrecord')[source]

Writes the Arrow dataset to a TFRecord file.

The dataset must already be in tensorflow format. The records will be written with keys from dataset._format_columns.

Parameters
  • filename (str) – The filename, including the .tfrecord extension, to write to.

  • format (str, optional, default “tfrecord”) – The type of output file. Currently this is a no-op, as TFRecords are the only option. This enables a more flexible function signature later.

filter(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None, suffix_template: str = '_{rank:05d}_of_{num_proc:05d}', new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.

Parameters
  • function (Callable) –

    Callable with one of the following signatures:

    • function(example: Union[Dict, Any]) -> bool if with_indices=False

    • function(example: Union[Dict, Any], indices: int) -> bool if with_indices=True

    If no function is provided, defaults to an always True function: lambda x: True.

  • with_indices (bool, default False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • input_columns (str or List[str], optional) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.

  • batch_size (int, optional, default 1000) – Number of examples per batch provided to function if batched = True. If batch_size <= 0 or batch_size == None: provide the full dataset as a single batch to function

  • remove_columns (List[str], optional) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • keep_in_memory (bool, default False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_name (str, optional) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • fn_kwargs (dict, optional) – Keyword arguments to be passed to function

  • num_proc (int, optional) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

  • suffix_template (str) – If cache_file_name is specified, then this suffix will be added at the end of the base name of each. For example, if cache_file_name is “processed.arrow”, then for rank = 1 and num_proc = 4, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix (default _{rank:05d}_of_{num_proc:05d})

  • new_fingerprint (str, optional) – The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.

flatten(new_fingerprint, max_depth=16) → Dataset[source]

Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Returns

Dataset – A copy of the dataset with flattened columns.

flatten_(max_depth=16)[source]

In-place version of Dataset.flatten().

Deprecated since version 1.4.0: Use Dataset.flatten() instead.

flatten_indices(keep_in_memory: bool = False, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = True, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Create and cache a new Dataset by flattening the indices mapping.

Parameters
  • keep_in_memory (bool, default False) – Keep the dataset in memory instead of writing it to a cache file.

  • cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • features (Optional[datasets.Features], default None) – Use a specific Features to store the cache file instead of the automatically generated one.

  • disable_nullable (bool, default True) – Allow null values in the table.

  • new_fingerprint (Optional[str], default None) – The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

formatted_as(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

To be used in a with statement. Set __getitem__ return format (type and columns).

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

classmethod from_buffer(buffer: pyarrow.lib.Buffer, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_buffer: Optional[pyarrow.lib.Buffer] = None) → datasets.arrow_dataset.Dataset[source]

Instantiate a Dataset backed by an Arrow buffer.

Parameters
  • buffer (pyarrow.Buffer) – Arrow buffer.

  • info (DatasetInfo, optional) – Dataset information, like description, citation, etc.

  • split (NamedSplit, optional) – Name of the dataset split.

  • indices_buffer (pyarrow.Buffer, optional) – Indices Arrow buffer.

Returns

Dataset

static from_csv(path_or_paths: Union[str, bytes, os.PathLike, List[Union[str, bytes, os.PathLike]]], split: Optional[datasets.splits.NamedSplit] = None, features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, **kwargs)[source]

Create Dataset from CSV file(s).

Parameters
  • path_or_paths (path-like or list of path-like) – Path(s) of the CSV file(s).

  • split (NamedSplit, optional) – Split name to be assigned to the dataset.

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default "~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default False) – Whether to copy the data in-memory.

  • **kwargs – Keyword arguments to be passed to pandas.read_csv().

Returns

Dataset

classmethod from_dict(mapping: dict, features: Optional[datasets.features.Features] = None, info: Optional[Any] = None, split: Optional[Any] = None) → datasets.arrow_dataset.Dataset[source]

Convert dict to a pyarrow.Table to create a Dataset.

Parameters
  • mapping (Mapping) – Mapping of strings to Arrays or Python lists.

  • features (Features, optional) – Dataset features.

  • info (DatasetInfo, optional) – Dataset information, like description, citation, etc.

  • split (NamedSplit, optional) – Name of the dataset split.

Returns

Dataset

classmethod from_file(filename: str, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_filename: Optional[str] = None, in_memory: bool = False) → datasets.arrow_dataset.Dataset[source]

Instantiate a Dataset backed by an Arrow table at filename.

Parameters
  • filename (str) – File name of the dataset.

  • info (DatasetInfo, optional) – Dataset information, like description, citation, etc.

  • split (NamedSplit, optional) – Name of the dataset split.

  • indices_filename (str, optional) – File names of the indices.

  • in_memory (bool, default False) – Whether to copy the data in-memory.

Returns

Dataset

static from_json(path_or_paths: Union[str, bytes, os.PathLike, List[Union[str, bytes, os.PathLike]]], split: Optional[datasets.splits.NamedSplit] = None, features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, field: Optional[str] = None, **kwargs)[source]

Create Dataset from JSON or JSON Lines file(s).

Parameters
  • path_or_paths (path-like or list of path-like) – Path(s) of the JSON or JSON Lines file(s).

  • split (NamedSplit, optional) – Split name to be assigned to the dataset.

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default "~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default False) – Whether to copy the data in-memory.

  • field (str, optional) – Field name of the JSON file where the dataset is contained in.

  • **kwargs – Keyword arguments to be passed to JsonConfig.

Returns

Dataset

classmethod from_pandas(df: pandas.core.frame.DataFrame, features: Optional[datasets.features.Features] = None, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None) → datasets.arrow_dataset.Dataset[source]

Convert pandas.DataFrame to a pyarrow.Table to create a Dataset.

The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.

Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing explicit features and passing it to this function.

Parameters
  • df (pandas.DataFrame) – Dataframe that contains the dataset.

  • features (Features, optional) – Dataset features.

  • info (DatasetInfo, optional) – Dataset information, like description, citation, etc.

  • split (NamedSplit, optional) – Name of the dataset split.

Returns

Dataset

static from_text(path_or_paths: Union[str, bytes, os.PathLike, List[Union[str, bytes, os.PathLike]]], split: Optional[datasets.splits.NamedSplit] = None, features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, **kwargs)[source]

Create Dataset from text file(s).

Parameters
  • path_or_paths (path-like or list of path-like) – Path(s) of the text file(s).

  • split (NamedSplit, optional) – Split name to be assigned to the dataset.

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default "~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default False) – Whether to copy the data in-memory.

  • **kwargs – Keyword arguments to be passed to TextConfig.

Returns

Dataset

get_index(index_name: str) → datasets.search.BaseIndex

List the index_name/identifiers of all the attached indexes.

Parameters

index_name (str) – Index name.

Returns

BaseIndex

get_nearest_examples(index_name: str, query: Union[str, numpy.array], k: int = 10) → datasets.search.NearestExamplesResults

Find the nearest examples in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • query (Union[str, np.ndarray]) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve.

Returns

scores (List[float]) – The retrieval scores of the retrieved examples. examples (dict): The retrieved examples.

get_nearest_examples_batch(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → datasets.search.BatchedNearestExamplesResults

Find the nearest examples in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • queries (Union[List[str], np.ndarray]) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve per query.

Returns

total_scores (List[List[float]) – The retrieval scores of the retrieved examples per query. total_examples (List[dict]): The retrieved examples per query.

property info

datasets.DatasetInfo object containing all the metadata in the dataset.

list_indexes() → List[str]

List the colindex_nameumns/identifiers of all the attached indexes.

load_elasticsearch_index(index_name: str, es_index_name: str, host: Optional[str] = None, port: Optional[int] = None, es_client: Optional[Elasticsearch] = None, es_index_config: Optional[dict] = None)

Load an existing text index using ElasticSearch for fast retrieval.

Parameters
  • index_name (str) – The index_name/identifier of the index. This is the index name that is used to call .get_nearest or .search.

  • es_index_name (str) – The name of elasticsearch index to load.

  • host (Optional str, defaults to localhost) – host of where ElasticSearch is running

  • port (Optional str, defaults to 9200) – port of where ElasticSearch is running

  • es_client (Optional elasticsearch.Elasticsearch) – The elasticsearch client used to create the index if host and port are None.

  • es_index_config (Optional dict) –

    The configuration of the elasticsearch index. Default config is:

    {
        "settings": {
            "number_of_shards": 1,
            "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}},
        },
        "mappings": {
            "properties": {
                "text": {
                    "type": "text",
                    "analyzer": "standard",
                    "similarity": "BM25"
                },
            }
        },
    }
    

load_faiss_index(index_name: str, file: Union[str, pathlib.PurePath], device: Optional[int] = None)

Load a FaissIndex from disk.

If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index to make it fit your needs.

Parameters
  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.

  • file (str) – The path to the serialized faiss index on disk.

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

static load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] = None) → datasets.arrow_dataset.Dataset[source]

Loads a dataset that was previously saved using save_to_disk() from a dataset directory, or from a filesystem using either S3FileSystem or any implementation of fsspec.spec.AbstractFileSystem.

Parameters
  • dataset_path (str) – Path (e.g. dataset/train) or remote URI (e.g. s3//my-bucket/dataset/train) of the dataset directory where the dataset will be loaded from.

  • fs (S3FileSystem, fsspec.spec.AbstractFileSystem, optional, default None) – Instance of the remote filesystem used to download the files from.

  • keep_in_memory (bool, default None) – Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the Enhancing performance section.

Returns

Dataset or DatasetDict.
  • if dataset_path is a path of a dataset directory: the Dataset requested,

  • if dataset_path is a path of a dataset dict directory: a DatasetDict with each split.

map(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, batched: bool = False, batch_size: Optional[int] = 1000, drop_last_batch: bool = False, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = None, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = False, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None, suffix_template: str = '_{rank:05d}_of_{num_proc:05d}', new_fingerprint: Optional[str] = None, desc: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Apply a function to all the elements in the table (individually or in batches) and update the table (if function does update examples).

Parameters
  • function (Callable) –

    Function with one of the following signatures:

    • function(example: Union[Dict, Any]) -> Union[Dict, Any] if batched=False and with_indices=False

    • function(example: Union[Dict, Any], indices: int) -> Union[Dict, Any] if batched=False and with_indices=True

    • function(batch: Union[Dict[List], List[Any]]) -> Union[Dict, Any] if batched=True and with_indices=False

    • function(batch: Union[Dict[List], List[Any]], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True

    If no function is provided, default to identity function: lambda x: x.

  • with_indices (bool, default False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • input_columns (Optional[Union[str, List[str]]], default None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.

  • batched (bool, default False) – Provide batch of examples to function.

  • batch_size (Optional[int], default 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function.

  • drop_last_batch (bool, default False) – Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.

  • remove_columns (Optional[List[str]], default None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • keep_in_memory (bool, default False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True if caching is enabled) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • features (Optional[datasets.Features], default None) – Use a specific Features to store the cache file instead of the automatically generated one.

  • disable_nullable (bool, default True) – Disallow null values in the table.

  • fn_kwargs (Optional[Dict], default None) – Keyword arguments to be passed to function.

  • num_proc (Optional[int], default None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

  • suffix_template (str) – If cache_file_name is specified, then this suffix will be added at the end of the base name of each: defaults to “_{rank:05d}_of_{num_proc:05d}”. For example, if cache_file_name is “processed.arrow”, then for rank=1 and num_proc=4, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix.

  • new_fingerprint (Optional[str], default None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.

  • desc (Optional[str], defaults to None) – Meaningful description to be displayed alongside with the progress bar while mapping examples.

property num_columns

Number of columns in the dataset.

property num_rows

Number of rows in the dataset (same as Dataset.__len__()).

prepare_for_task(task: Union[str, datasets.tasks.base.TaskTemplate]) → datasets.arrow_dataset.Dataset[source]

Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.

Casts datasets.DatasetInfo.features according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates after casting.

Parameters

task (Union[str, TaskTemplate]) –

The task to prepare the dataset for during training and evaluation. If str, supported tasks include:

  • "text-classification"

  • "question-answering"

If TaskTemplate, must be one of the task templates in datasets.tasks.

remove_columns(column_names: Union[str, List[str]], new_fingerprint) → Dataset[source]

Remove one or several column(s) in the dataset and the features associated to them.

You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters
  • column_names (Union[str, List[str]]) – Name of the column(s) to remove.

  • new_fingerprint

Returns

Dataset – A copy of the dataset object without the columns to remove.

remove_columns_(column_names: Union[str, List[str]])[source]

In-place version of Dataset.remove_columns().

Deprecated since version 1.4.0: Use Dataset.remove_columns() instead.

Parameters

column_names (Union[str, List[str]]) – Name of the column(s) to remove.

rename_column(original_column_name: str, new_column_name: str, new_fingerprint) → Dataset[source]

Rename a column in the dataset, and move the features associated to the original column under the new column name.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

  • new_fingerprint

Returns

Dataset – A copy of the dataset with a renamed column.

rename_column_(original_column_name: str, new_column_name: str)[source]

In-place version of Dataset.rename_column().

Deprecated since version 1.4.0: Use Dataset.rename_column() instead.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

reset_format()[source]

Reset __getitem__ return format to python objects and all columns.

Same as self.set_format()

save_faiss_index(index_name: str, file: Union[str, pathlib.PurePath])

Save a FaissIndex on disk.

Parameters
  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.

  • file (str) – The path to the serialized faiss index on disk.

save_to_disk(dataset_path: str, fs=None)[source]

Saves a dataset to a dataset directory, or in a filesystem using either S3FileSystem or any implementation of fsspec.spec.AbstractFileSystem.

Note regarding sliced datasets:

If you sliced the dataset in some way (using shard, train_test_split or select for example), then an indices mapping is added to avoid having to rewrite a new arrow Table (save time + disk/memory usage). It maps the indices used by __getitem__ to the right rows if the arrow Table. By default save_to_disk does save the full dataset table + the mapping.

If you want to only save the shard of the dataset instead of the original arrow file and the indices, then you have to call datasets.Dataset.flatten_indices() before saving. This will create a new arrow table by using the right rows of the original table.

Parameters
  • dataset_path (str) – Path (e.g. dataset/train) or remote URI (e.g. s3://my-bucket/dataset/train) of the dataset directory where the dataset will be saved to.

  • fs (S3FileSystem, fsspec.spec.AbstractFileSystem, optional, defaults None) – Instance of the remote filesystem used to download the files from.

search(index_name: str, query: Union[str, numpy.array], k: int = 10) → datasets.search.SearchResults

Find the nearest examples indices in the dataset to the query.

Parameters
  • index_name (str) – The name/identifier of the index.

  • query (Union[str, np.ndarray]) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve.

Returns

scores (List[List[float]) – The retrieval scores of the retrieved examples. indices (List[List[int]]): The indices of the retrieved examples.

search_batch(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → datasets.search.BatchedSearchResults

Find the nearest examples indices in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • queries (Union[List[str], np.ndarray]) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve per query.

Returns

total_scores (List[List[float]) – The retrieval scores of the retrieved examples per query. total_indices (List[List[int]]): The indices of the retrieved examples per query.

select(indices: collections.abc.Iterable, keep_in_memory: bool = False, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Create a new dataset with rows selected following the list/array of indices.

Parameters
  • indices (sequence, iterable, ndarray or Series) – List or 1D-array of integer indices for indexing.

  • keep_in_memory (bool, default False) – Keep the indices mapping in memory instead of writing it to a cache file.

  • indices_cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • new_fingerprint (Optional[str], default None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

set_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__. It’s also possible to use custom transforms for formatting using datasets.Dataset.set_transform().

Parameters
  • type (Optional str) – Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output. None means __getitem__ returns all columns (default).

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

It is possible to call map after calling set_format. Since map may add new columns, then the list of formatted columns gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted:

new formatted columns = (all columns - previously unformatted columns)

set_transform(transform: Optional[Callable], columns: Optional[List] = None, output_all_columns: bool = False)[source]

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called. As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format()

Parameters
  • transform (Optional Callable) – user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in __getitem__.

  • columns (Optional List[str]) – columns to format in the output If specified, then the input batch of the transform only contains those columns.

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.

property shape

Shape of the dataset (number of columns, number of rows).

shard(num_shards: int, index: int, contiguous: bool = False, keep_in_memory: bool = False, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000) → datasets.arrow_dataset.Dataset[source]

Return the index-nth shard from dataset split into num_shards pieces.

This shards deterministically. dset.shard(n, i) will contain all elements of dset whose index mod n = i.

dset.shard(n, i, contiguous=True) will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l, then the first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i). datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)]) will return a dataset with the same order as the original.

Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.

Parameters
  • num_shards (int) – How many shards to split the dataset into.

  • index (int) – Which shard to select and return.

  • contiguous – (bool, default False): Whether to select contiguous blocks of indices for shards.

  • keep_in_memory (bool, default False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • indices_cache_file_name (str, optional) – Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

shuffle(seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Create a new Dataset where the rows are shuffled.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Parameters
  • seed (int, optional) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.

  • generator (numpy.random.Generator, optional) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).

  • keep_in_memory (bool, default False) – Keep the shuffled indices in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True) – If a cache file storing the shuffled indices can be identified, use it instead of recomputing.

  • indices_cache_file_name (str, optional) – Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • new_fingerprint (str, optional, default None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

sort(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]

Create a new dataset sorted according to a column.

Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).

Parameters
  • column (str) – column name to sort by.

  • reverse (bool, default False) – If True, sort by descending order rather then ascending.

  • kind (str, optional) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.

  • keep_in_memory (bool, default False) – Keep the sorted indices in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True) – If a cache file storing the sorted indices can be identified, use it instead of recomputing.

  • indices_cache_file_name (Optional[str], default None) – Provide the name of a path for the cache file. It is used to store the sorted indices instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.

  • new_fingerprint (Optional[str], default None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

property split

datasets.NamedSplit object corresponding to a named dataset split.

to_csv(path_or_buf: Union[str, bytes, os.PathLike, BinaryIO], batch_size: Optional[int] = None, **to_csv_kwargs) → int[source]

Exports the dataset to csv

Parameters
  • path_or_buf (PathLike or FileOrBuffer) – Either a path to a file or a BinaryIO.

  • batch_size (Optional int) – Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.

  • to_csv_kwargs – Parameters to pass to pandas’s pandas.DataFrame.to_csv()

Returns

int – The number of characters or bytes written

to_dict(batch_size: Optional[int] = None, batched: bool = False) → Union[dict, Iterator[dict]][source]

Returns the dataset as a Python dict. Can also return a generator for large datasets.

Parameters
  • batched (bool) – Set to True to return a generator that yields the dataset as batches of batch_size rows. Defaults to False (returns the whole datasetas once)

  • bacth_size (Optional int): The size (number of rows) – Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.

Returns

dict or Iterator[dict]

to_json(path_or_buf: Union[str, bytes, os.PathLike, BinaryIO], batch_size: Optional[int] = None, **to_json_kwargs) → int[source]

Export the dataset to JSON Lines or JSON.

Parameters
  • path_or_buf (PathLike or FileOrBuffer) – Either a path to a file or a BinaryIO.

  • batch_size (int, optional) – Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.

  • lines (bool, default True) – Whether output JSON lines format. Only possible if orient="records"`. It will throw ValueError with ``orient different from "records", since the others are not list-like.

  • orient (str, default "records") –

    Format of the JSON:

    • "records": list like [{column -> value}, , {column -> value}]

    • "split": dict like {"index" -> [index], "columns" -> [columns], "data" -> [values]}

    • "index": dict like {index -> {column -> value}}

    • "columns": dict like {column -> {index -> value}}

    • "values": just the values array

    • "table": dict like {"schema": {schema}, "data": {data}}

  • **to_json_kwargs – Parameters to pass to pandas’s pandas.DataFrame.to_json.

Returns

int – The number of characters or bytes written.

to_pandas(batch_size: Optional[int] = None, batched: bool = False) → Union[pandas.core.frame.DataFrame, Iterator[pandas.core.frame.DataFrame]][source]

Returns the dataset as a pandas.DataFrame. Can also return a generator for large datasets.

Parameters
  • batched (bool) – Set to True to return a generator that yields the dataset as batches of batch_size rows. Defaults to False (returns the whole datasetas once)

  • bacth_size (Optional int): The size (number of rows) – Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.

Returns

pandas.DataFrame or Iterator[pandas.DataFrame]

train_test_split(test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, train_indices_cache_file_name: Optional[str] = None, test_indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, train_new_fingerprint: Optional[str] = None, test_new_fingerprint: Optional[str] = None) → DatasetDict[source]

Return a dictionary (datasets.DatsetDict) with two random train and test subsets (train and test Dataset splits). Splits are created from the dataset according to test_size, train_size and shuffle.

This method is similar to scikit-learn train_test_split with the omission of the stratified options.

Parameters
  • test_size (numpy.random.Generator, optional) – Size of the test split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size (numpy.random.Generator, optional) – Size of the train split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • shuffle (bool, optional, default True) – Whether or not to shuffle the data before splitting.

  • seed (int, optional) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.

  • generator (numpy.random.Generator, optional) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).

  • keep_in_memory (bool, default False) – Keep the splits indices in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, default True) – If a cache file storing the splits indices can be identified, use it instead of recomputing.

  • train_cache_file_name (str, optional) – Provide the name of a path for the cache file. It is used to store the train split indices instead of the automatically generated cache file name.

  • test_cache_file_name (str, optional) – Provide the name of a path for the cache file. It is used to store the test split indices instead of the automatically generated cache file name.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • train_new_fingerprint (str, optional, defaults to None) – the new fingerprint of the train set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

  • test_new_fingerprint (str, optional, defaults to None) – the new fingerprint of the test set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments

unique(column: str) → List[Any][source]

Return a list of the unique elements in a column.

This is implemented in the low-level backend and as such, very fast.

Parameters

column (str) – Column name (list all the column names with datasets.Dataset.column_names()).

Returns

list – List of unique elements in the given column.

with_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__.

It’s also possible to use custom transforms for formatting using datasets.Dataset.with_transform().

Contrary to datasets.Dataset.set_format(), with_format returns a new Dataset object.

Parameters
  • type (Optional str) – Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

with_transform(transform: Optional[Callable], columns: Optional[List] = None, output_all_columns: bool = False)[source]

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called.

As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format().

Contrary to datasets.Dataset.set_transform(), with_transform returns a new Dataset object.

Parameters
  • transform (Optional Callable) – user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in __getitem__.

  • columns (Optional List[str]) – columns to format in the output If specified, then the input batch of the transform only contains those columns.

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.

datasets.concatenate_datasets(dsets: List[datasets.arrow_dataset.Dataset], info: Optional[Any] = None, split: Optional[Any] = None, axis: int = 0)[source]

Converts a list of Dataset with the same schema into a single Dataset.

Parameters
  • dsets (List[datasets.Dataset]) – List of Datasets to concatenate.

  • info (DatasetInfo, optional) – Dataset information, like description, citation, etc.

  • split (NamedSplit, optional) – Name of the dataset split.

  • axis ({0, 1}, default 0, meaning over rows) –

    Axis to concatenate over, where 0 means over rows (vertically) and 1 means over columns (horizontally).

    New in version 1.6.0.

datasets.set_caching_enabled(boolean: bool)[source]

When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.

Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.

If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled: - cache files are always recreated - cache files are written to a temporary directory that is deleted when session closes - cache files are named using a random hash instead of the dataset fingerprint - use datasets.Dataset.save_to_disk() to save a transformed dataset or it will be deleted when session closes - caching doesn’t affect datasets.load_dataset(). If you want to regenerate a dataset from scratch you should use the download_mode parameter in datasets.load_dataset().

datasets.is_caching_enabled() → bool[source]

When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.

Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.

If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled: - cache files are always recreated - cache files are written to a temporary directory that is deleted when session closes - cache files are named using a random hash instead of the dataset fingerprint - use datasets.Dataset.save_to_disk() to save a transformed dataset or it will be deleted when session closes - caching doesn’t affect datasets.load_dataset(). If you want to regenerate a dataset from scratch you should use the download_mode parameter in datasets.load_dataset().

DatasetDict

Dictionary with split names as keys (‘train’, ‘test’ for example), and datasets.Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.

class datasets.DatasetDict[source]

A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)

property cache_files

The cache files containing the Apache Arrow table backing each split.

cast(features: datasets.features.Features)[source]

Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.

You can also remove a column using Dataset.map() with feature but cast_() is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

features (datasets.Features) – New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

cast_(features: datasets.features.Features)[source]

In-place version of DatasetDict.cast().

Deprecated since version 1.4.0: Use DatasetDict.cast() instead.

Parameters

features (datasets.Features) – New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

class_encode_column(column: str)[source]

Casts the given column as :obj:datasets.features.ClassLabel and updates the tables.

Parameters

column (str) – The name of the column to cast

cleanup_cache_files() → Dict[str, int][source]

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be carefull when running this command that no other process is currently using other cache files.

Returns

Dict with the number of removed files for each split

property column_names

Names of the columns in each split of the dataset.

property data

The Apache Arrow tables backing each split.

filter(function, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, Optional[str]]] = None, writer_batch_size: Optional[int] = 1000, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None) → datasets.dataset_dict.DatasetDict[source]

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> bool if with_indices=False - function(example: Dict, indices: int) -> bool if with_indices=True

  • with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.

  • batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function

  • num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

flatten(max_depth=16)[source]

Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

flatten_(max_depth=16)[source]

In-place version of DatasetDict.flatten().

Deprecated since version 1.4.0: Use DatasetDict.flatten() instead.

formatted_as(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

To be used in a with statement. Set __getitem__ return format (type and columns) The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

static from_csv(path_or_paths: Dict[str, Union[str, bytes, os.PathLike]], features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, **kwargs)[source]

Create DatasetDict from CSV file(s).

Parameters
  • path_or_paths (dict of path-like) – Path(s) of the CSV file(s).

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default="~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default=False) – Whether to copy the data in-memory.

  • **kwargs – Keyword arguments to be passed to pandas.read_csv().

Returns

DatasetDict

static from_json(path_or_paths: Dict[str, Union[str, bytes, os.PathLike]], features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, **kwargs)[source]

Create DatasetDict from JSON Lines file(s).

Parameters
  • path_or_paths (path-like or list of path-like) – Path(s) of the JSON Lines file(s).

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default="~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default=False) – Whether to copy the data in-memory.

  • **kwargs – Keyword arguments to be passed to JsonConfig.

Returns

DatasetDict

static from_text(path_or_paths: Dict[str, Union[str, bytes, os.PathLike]], features: Optional[datasets.features.Features] = None, cache_dir: str = None, keep_in_memory: bool = False, **kwargs)[source]

Create DatasetDict from text file(s).

Parameters
  • path_or_paths (dict of path-like) – Path(s) of the text file(s).

  • features (Features, optional) – Dataset features.

  • cache_dir (str, optional, default="~/datasets") – Directory to cache data.

  • keep_in_memory (bool, default=False) – Whether to copy the data in-memory.

  • **kwargs – Keyword arguments to be passed to TextConfig.

Returns

DatasetDict

static load_from_disk(dataset_dict_path: str, fs=None, keep_in_memory: Optional[bool] = None) → datasets.dataset_dict.DatasetDict[source]

Load a dataset that was previously saved using save_to_disk() from a filesystem using either S3FileSystem or fsspec.spec.AbstractFileSystem.

Parameters
  • dataset_dict_path (str) – Path (e.g. "dataset/train") or remote URI (e.g. "s3//my-bucket/dataset/train") of the dataset dict directory where the dataset dict will be loaded from.

  • fs (S3FileSystem or fsspec.spec.AbstractFileSystem, optional, default None) – Instance of the remote filesystem used to download the files from.

  • keep_in_memory (bool, default None) – Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the Enhancing performance section.

Returns

DatasetDict

map(function, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, batched: bool = False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, Optional[str]]] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = False, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None, desc: Optional[str] = None) → datasets.dataset_dict.DatasetDict[source]

Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> Union[Dict, Any] if batched=False and with_indices=False - function(example: Dict, indices: int) -> Union[Dict, Any] if batched=False and with_indices=True - function(batch: Dict[List]) -> Union[Dict, Any] if batched=True and with_indices=False - function(batch: Dict[List], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True

  • with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.

  • batched (bool, defaults to False) – Provide batch of examples to function

  • batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

  • features (Optional[datasets.Features], defaults to None) – Use a specific Features to store the cache file instead of the automatically generated one.

  • disable_nullable (bool, defaults to True) – Disallow null values in the table.

  • fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function

  • num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.

  • desc (Optional[str], defaults to None) – Meaningful description to be displayed alongside with the progress bar while mapping examples.

property num_columns

Number of columns in each split of the dataset.

property num_rows

Number of rows in each split of the dataset (same as datasets.Dataset.__len__()).

prepare_for_task(task: Union[str, datasets.tasks.base.TaskTemplate])[source]

Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.

Casts datasets.DatasetInfo.features according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates after casting.

Parameters

task (Union[str, TaskTemplate]) –

The task to prepare the dataset for during training and evaluation. If str, supported tasks include:

  • "text-classification"

  • "question-answering"

If TaskTemplate, must be one of the task templates in datasets.tasks.

remove_columns(column_names: Union[str, List[str]])[source]

Remove one or several column(s) from each split in the dataset and the features associated to the column(s).

The transformation is applied to all the splits of the dataset dictionary.

You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

column_names (Union[str, List[str]]) – Name of the column(s) to remove.

remove_columns_(column_names: Union[str, List[str]])[source]

In-place version of DatasetDict.remove_columns().

Deprecated since version 1.4.0: Use DatasetDict.remove_columns() instead.

Parameters

column_names (Union[str, List[str]]) – Name of the column(s) to remove.

rename_column(original_column_name: str, new_column_name: str)[source]

Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.

You can also rename a column using Dataset.map() with remove_columns but the present method:
  • takes care of moving the original features under the new column name.

  • doesn’t copy the data to a new dataset and is thus much faster.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

rename_column_(original_column_name: str, new_column_name: str)[source]

In-place version of DatasetDict.rename_column().

Deprecated since version 1.4.0: Use DatasetDict.rename_column() instead.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

reset_format()[source]

Reset __getitem__ return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.

Same as self.set_format()

save_to_disk(dataset_dict_path: str, fs=None)[source]

Saves a dataset dict to a filesystem using either S3FileSystem or fsspec.spec.AbstractFileSystem.

Parameters
  • dataset_dict_path (str) – Path (e.g. dataset/train) or remote URI (e.g. s3://my-bucket/dataset/train) of the dataset dict directory where the dataset dict will be saved to.

  • fs (S3FileSystem, fsspec.spec.AbstractFileSystem, optional, defaults None) – Instance of the remote filesystem used to download the files from.

set_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns) The format is set for every dataset in the dataset dictionary

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output. None means __getitem__ returns all columns (default).

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

It is possible to call map after calling set_format. Since map may add new columns, then the list of formatted columns gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted:

new formatted columns = (all columns - previously unformatted columns)

property shape

Shape of each split of the dataset (number of columns, number of rows).

shuffle(seeds: Optional[Union[int, Dict[str, Optional[int]]]] = None, seed: Optional[int] = None, generators: Optional[Dict[str, numpy.random._generator.Generator]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_names: Optional[Dict[str, Optional[str]]] = None, writer_batch_size: Optional[int] = 1000)[source]

Create a new Dataset where the rows are shuffled.

The transformation is applied to all the datasets of the dataset dictionary.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Parameters
  • seeds (Dict[str, int] or int, optional) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide one seed per dataset in the dataset dictionary.

  • seed (Optional int) – A seed to initialize the default BitGenerator if generator=None. Alias for seeds (the seed argument has priority over seeds if both arguments are provided).

  • generators (Optional Dict[str, np.random.Generator]) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). You have to provide one generator per dataset in the dataset dictionary.

  • keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • indices_cache_file_names (Dict[str, str], optional) – Provide the name of a path for the cache file. It is used to store the indices mappings instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

sort(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_names: Optional[Dict[str, Optional[str]]] = None, writer_batch_size: Optional[int] = 1000) → datasets.dataset_dict.DatasetDict[source]

Create a new dataset sorted according to a column. The transformation is applied to all the datasets of the dataset dictionary.

Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).

Parameters
  • column (str) – column name to sort by.

  • reverse – (bool, defaults to False): If True, sort by descending order rather then ascending.

  • kind (Optional str) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.

  • keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.

  • load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • indices_cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • writer_batch_size (int, default 1000) – Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().

unique(column: str) → Dict[str, List[Any]][source]

Return a list of the unique elements in a column for each split.

This is implemented in the low-level backend and as such, very fast.

Parameters

column (str) – column name (list all the column names with datasets.Dataset.column_names())

Returns

Dict[str, list] – Dictionary of unique elements in the given column.

with_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly. The format type (for example “numpy”) is used to format batches when using __getitem__. The format is set for every dataset in the dataset dictionary

It’s also possible to use custom transforms for formatting using datasets.Dataset.with_transform().

Contrary to datasets.DatasetDict.set_format(), with_format returns a new DatasetDict object with new Dataset objects.

Parameters
  • type (Optional str) – Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

with_transform(transform: Optional[Callable], columns: Optional[List] = None, output_all_columns: bool = False)[source]

Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called. The transform is set for every dataset in the dataset dictionary

As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format().

Contrary to datasets.DatasetDict.set_transform(), with_transform returns a new DatasetDict object with new Dataset objects.

Parameters
  • transform (Optional Callable) – user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in __getitem__.

  • columns (Optional List[str]) – columns to format in the output If specified, then the input batch of the transform only contains those columns.

  • output_all_columns (bool default to False) – keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.

Features

class datasets.Features[source]
copy() → a shallow copy of D[source]
reorder_fields_as(other: datasets.features.Features) → datasets.features.Features[source]

The order of the fields is important since it matters for the underlying arrow data. This method is used to re-order your features to match the fields orders of other features.

Re-ordering the fields allows to make the underlying arrow data type match.

Example:

>>> from datasets import Features, Sequence, Value
>>> # let's say we have to features with a different order of nested fields (for a and b for example)
>>> f1 = Features({"root": Sequence({"a": Value("string"), "b": Value("string")})})
>>> f2 = Features({"root": {"b": Sequence(Value("string")), "a": Sequence(Value("string"))}})
>>> assert f1.type != f2.type
>>> # re-ordering keeps the base structure (here Sequence is defined at the root level), but make the fields order match
>>> f1.reorder_fields_as(f2)
{'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)}
>>> assert f1.reorder_fields_as(f2).type == f2.type
class datasets.Sequence(feature: Any, length: int = - 1, id: Optional[str] = None)[source]

Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.

class datasets.ClassLabel(num_classes: int = None, names: List[str] = None, names_file: Optional[str] = None, id: Optional[str] = None)[source]

Handle integer class labels. Here for compatiblity with tfds.

There are 3 ways to define a ClassLabel, which correspond to the 3 :param * num_classes: create 0 to (num_classes-1) labels :param * names: a list of label strings :param * names_file: a file containing the list of labels.

Note: On python2, the strings are encoded as utf-8.

Parameters
  • num_classesint, number of classes. All labels must be < num_classes.

  • nameslist<str>, string names for the integer classes. The order in which the names are provided is kept.

  • names_filestr, path to a file with names for the integer classes, one per line.

int2str(values: Union[int, collections.abc.Iterable])[source]

Conversion integer => class name string.

str2int(values: collections.abc.Iterable)[source]

Conversion class name string => integer.

class datasets.Value(dtype: str, id: Optional[str] = None)[source]

The Value dtypes are as follows:

null bool int8 int16 int32 int64 uint8 uint16 uint32 uint64 float16 float32 (alias float) float64 (alias double) timestamp[(s|ms|us|ns)] timestamp[(s|ms|us|ns), tz=(tzstring)] binary large_binary string large_string

class datasets.Translation(languages: List[str], id: Optional[str] = None)[source]

FeatureConnector for translations with fixed languages per example. Here for compatiblity with tfds.

Input: The Translate feature accepts a dictionary for each example mapping

string language codes to string translations.

Output: A dictionary mapping string language codes to translations as Text

features.

Example:

# At construction time:

datasets.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
        'en': 'the cat',
        'fr': 'le chat',
        'de': 'die katze'
}
class datasets.TranslationVariableLanguages(languages: Optional[List] = None, num_languages: Optional[int] = None, id: Optional[str] = None)[source]

FeatureConnector for translations with variable languages per example. Here for compatiblity with tfds.

Input: The TranslationVariableLanguages feature accepts a dictionary for each

example mapping string language codes to one or more string translations. The languages present may vary from example to example.

Output:
language: variable-length 1D tf.Tensor of tf.string language codes, sorted

in ascending order.

translation: variable-length 1D tf.Tensor of tf.string plain text

translations, sorted to align with language codes.

Example:

# At construction time:

datasets.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
        'en': 'the cat',
        'fr': ['le chat', 'la chatte,']
        'de': 'die katze'
}

# Tensor returned :

{
        'language': ['en', 'de', 'fr', 'fr'],
        'translation': ['the cat', 'die katze', 'la chatte', 'le chat'],
}
class datasets.Array2D(shape: tuple, dtype: str, id: Union[str, NoneType] = None)[source]
class datasets.Array3D(shape: tuple, dtype: str, id: Union[str, NoneType] = None)[source]
class datasets.Array4D(shape: tuple, dtype: str, id: Union[str, NoneType] = None)[source]
class datasets.Array5D(shape: tuple, dtype: str, id: Union[str, NoneType] = None)[source]

MetricInfo

class datasets.MetricInfo(description: str, citation: str, features: datasets.features.Features, inputs_description: str = <factory>, homepage: str = <factory>, license: str = <factory>, codebase_urls: List[str] = <factory>, reference_urls: List[str] = <factory>, streamable: bool = False, format: Optional[str] = None, metric_name: Optional[str] = None, config_name: Optional[str] = None, experiment_id: Optional[str] = None)[source]

Information about a metric.

MetricInfo documents a metric, including its name, version, and features. See the constructor arguments and properties for a full list.

Note: Not all fields are known on construction and may be updated later.

classmethod from_directory(metric_info_dir) → datasets.info.MetricInfo[source]

Create MetricInfo from the JSON file in metric_info_dir.

Parameters

metric_info_dirstr The directory containing the metadata file. This should be the root directory of a specific dataset version.

write_to_directory(metric_info_dir)[source]

Write MetricInfo as JSON to metric_info_dir. Also save the license separately in LICENCE.

Metric

The base class Metric implements a Metric backed by one or several datasets.Dataset.

class datasets.Metric(config_name: Optional[str] = None, keep_in_memory: bool = False, cache_dir: Optional[str] = None, num_process: int = 1, process_id: int = 0, seed: Optional[int] = None, experiment_id: Optional[str] = None, max_concurrent_cache_files: int = 10000, timeout: Union[int, float] = 100, **kwargs)[source]

A Metric is the base class and common API for all metrics.

Parameters
  • config_name (str) – This is used to define a hash specific to a metrics computation script and prevents the metric’s data to be overridden when the metric loading script is modified.

  • keep_in_memory (bool) – keep all predictions and references in memory. Not possible in distributed settings.

  • cache_dir (str) – Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups.

  • num_process (int) – specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).

  • process_id (int) – specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).

  • seed (Optional int) – If specified, this will temporarily set numpy’s random seed when datasets.Metric.compute() is run.

  • experiment_id (str) – A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).

  • max_concurrent_cache_files (int) – Max number of concurrent metrics cache files (default 10000).

  • timeout (Union[int, float]) – Timeout in second for distributed setting synchronization.

add(*, prediction=None, reference=None)[source]

Add one prediction and reference for the metric’s stack.

Parameters
  • prediction (list/array/tensor, optional) – Predictions.

  • reference (list/array/tensor, optional) – References.

add_batch(*, predictions=None, references=None)[source]

Add a batch of predictions and references for the metric’s stack.

Parameters
  • predictions (list/array/tensor, optional) – Predictions.

  • references (list/array/tensor, optional) – References.

compute(*, predictions=None, references=None, **kwargs) → Optional[dict][source]

Compute the metrics.

Usage of positional arguments is not allowed to prevent mistakes.

Parameters
  • predictions (list/array/tensor, optional) – Predictions.

  • references (list/array/tensor, optional) – References.

  • **kwargs (optional) – Keyword arguments that will be forwarded to the metrics _compute() method (see details in the docstring).

Returns

dict or None

  • Dictionary with the metrics if this metric is run on the main process (process_id == 0).

  • None if the metric is not run on the main process (process_id != 0).

download_and_prepare(download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, dl_manager: Optional[datasets.utils.download_manager.DownloadManager] = None)[source]

Downloads and prepares dataset for reading.

Parameters
  • download_config (DownloadConfig, optional) – Specific download configuration parameters.

  • dl_manager (DownloadManager, optional) – Specific download manager to use.

Filesystems

class datasets.filesystems.S3FileSystem(anon=False, key=None, secret=None, token=None, use_ssl=True, client_kwargs=None, requester_pays=False, default_block_size=None, default_fill_cache=True, default_cache_type='bytes', version_aware=False, config_kwargs=None, s3_additional_kwargs=None, session=None, username=None, password=None, asynchronous=False, loop=None, **kwargs)[source]

Access S3 as if it were a file system.

This exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage.

Provide credentials either explicitly (key=, secret=) or depend on boto’s credential methods. See botocore documentation for more information. If no credentials are available, use anon=True.

Parameters
  • anon (bool (False)) – Whether to use anonymous connection (public buckets only). If False, uses the key/secret given, or boto’s credential resolver (environment variables, config files, EC2 IAM server, in that order)

  • key (string (None)) – If not anonymous, use this access key ID, if specified

  • secret (string (None)) – If not anonymous, use this secret access key, if specified

  • token (string (None)) – If not anonymous, use this security token, if specified

  • use_ssl (bool (True)) – Whether to use SSL in connections to S3; may be faster without, but insecure

  • s3_additional_kwargs (dict of parameters that are used when calling s3 api) – methods. Typically used for things like “ServerSideEncryption”.

  • client_kwargs (dict of parameters for the botocore client) –

  • requester_pays (bool (False)) – If RequesterPays buckets are supported.

  • default_block_size (int (None)) – If given, the default block size value used for open(), if no specific value is given at all time. The built-in default is 5MB.

  • default_fill_cache (Bool (True)) – Whether to use cache filling with open by default. Refer to S3File.open.

  • default_cache_type (string ('bytes')) – If given, the default cache_type value used for open(). Set to “none” if no caching is desired. See fsspec’s documentation for other available cache_type values. Default cache_type is ‘bytes’.

  • version_aware (bool (False)) – Whether to support bucket versioning. If enable this will require the user to have the necessary IAM permissions for dealing with versioned objects.

  • config_kwargs (dict of parameters passed to botocore.client.Config) –

  • kwargs (other parameters for core session) –

  • session (botocore Session object to be used for all connections.) – This session will be used inplace of creating a new session inside S3FileSystem.

  • following parameters are passed on to fsspec (The) –

  • skip_instance_cache (to control reuse of instances) –

  • listings_expiry_time, max_paths (use_listings_cache,) –

datasets.filesystems.S3FileSystem is a subclass of s3fs.S3FileSystem, which is a known implementation of fsspec. Filesystem Spec (FSSPEC) is a project to unify various projects and classes to work with remote filesystems and file-system-like abstractions using a standard pythonic interface.

Examples

Listing files from public s3 bucket.

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)  
>>> s3.ls('public-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Listing files from private s3 bucket using aws_access_key_id and aws_secret_access_key.

>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>> s3.ls('my-private-datasets/imdb/train')  
['dataset_info.json.json','dataset.arrow','state.json']

Using S3Filesystem with botocore.session.Session and custom aws_profile.

>>> import botocore
>>> from datasets.filesystems import S3Filesystem
>>> s3_session = botocore.session.Session(profile_name='my_profile_name')
>>>
>>> s3 = S3FileSystem(session=s3_session)  

Loading dataset from s3 using S3Filesystem and load_from_disk().

>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3Filesystem
>>>
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>>
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  
>>>
>>> print(len(dataset))
25000

Saving dataset to s3 using S3Filesystem and dataset.save_to_disk().

>>> from datasets import load_dataset
>>> from datasets.filesystems import S3Filesystem
>>>
>>> dataset = load_dataset("imdb")
>>>
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
>>>
>>> dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)  
datasets.filesystems.extract_path_from_uri(dataset_path: str) → str[source]

preprocesses dataset_path and removes remote filesystem (e.g. removing s3://)

Parameters

dataset_path (str): path (e.g. dataset/train) or remote uri (e.g. s3://my-bucket/dataset/train) –

datasets.filesystems.is_remote_filesystem(fs: fsspec.spec.AbstractFileSystem) → bool[source]

Validates if filesystem has remote protocol.

Parameters

fs (fsspec.spec.AbstractFileSystem) – An abstract super-class for pythonic file-systems, e.g. fsspec.filesystem('file') or datasets.filesystems.S3FileSystem