Main classes
DatasetInfo
class datasets.DatasetInfo
< source >( description: str = <factory> citation: str = <factory> homepage: str = <factory> license: str = <factory> features: typing.Optional[datasets.features.features.Features] = None post_processed: typing.Optional[datasets.info.PostProcessedInfo] = None supervised_keys: typing.Optional[datasets.info.SupervisedKeysData] = None task_templates: typing.Optional[typing.List[datasets.tasks.base.TaskTemplate]] = None builder_name: typing.Optional[str] = None config_name: typing.Optional[str] = None version: typing.Union[str, datasets.utils.version.Version, NoneType] = None splits: typing.Optional[dict] = None download_checksums: typing.Optional[dict] = None download_size: typing.Optional[int] = None post_processing_size: typing.Optional[int] = None dataset_size: typing.Optional[int] = None size_in_bytes: typing.Optional[int] = None )
Parameters
- description (str) — A description of the dataset.
- citation (str) — A BibTeX citation of the dataset.
- homepage (str) — A URL to the official homepage for the dataset.
- license (str) — The dataset’s license. It can be the name of the license or a paragraph containing the terms of the license.
- features (Features, optional) — The features used to specify the dataset’s column types.
- post_processed (PostProcessedInfo, optional) — Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index.
- supervised_keys (SupervisedKeysData, optional) — Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).
- builder_name (str, optional) — The name of the GeneratorBasedBuilder subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.
- config_name (str, optional) — The name of the configuration derived from BuilderConfig
- version (str or Version, optional) — The version of the dataset.
- splits (dict, optional) — The mapping between split name and metadata.
- download_checksums (dict, optional) — The mapping between the URL to download the dataset’s checksums and corresponding metadata.
- download_size (int, optional) — The size of the files to download to generate the dataset, in bytes.
- post_processing_size (int, optional) — Size of the dataset in bytes after post-processing, if any.
- dataset_size (int, optional) — The combined size in bytes of the Arrow tables for all splits.
- size_in_bytes (int, optional) — The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files).
- task_templates (List[TaskTemplate], optional) — The task templates to prepare the dataset for during training and evaluation. Each template casts the dataset’s Features to standardized column names and types as detailed in :py:mod:datasets.tasks. **config_kwargs — Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.
Information about a dataset.
DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
from_directory
< source >( dataset_info_dir: str )
Create DatasetInfo from the JSON file in dataset_info_dir
.
This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.
This will overwrite all previous metadata.
Write DatasetInfo
as JSON to dataset_info_dir
.
Also save the license separately in LICENCE.
Dataset
The base class datasets.Dataset implements a Dataset backed by an Apache Arrow table.
class datasets.Dataset
< source >( arrow_table: Table info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_table: typing.Optional[datasets.table.Table] = None fingerprint: typing.Optional[str] = None )
A Dataset backed by an Arrow table.
add_column
< source >( name: str column: typing.Union[list, <built-in function array>] new_fingerprint: str ) → Dataset
Parameters
Returns
Add column to Dataset.
New in version 1.7.
add_item
< source >( item: dict new_fingerprint: str ) → Dataset
Add item to Dataset.
New in version 1.7.
from_file
< source >( filename: str info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_filename: typing.Optional[str] = None in_memory: bool = False ) → Dataset
Parameters
-
filename (
str
) — File name of the dataset. - info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- split (NamedSplit, optional) — Name of the dataset split.
-
indices_filename (
str
, optional) — File names of the indices. -
in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory.
Returns
Instantiate a Dataset backed by an Arrow table at filename.
from_buffer
< source >( buffer: Buffer info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_buffer: typing.Optional[pyarrow.lib.Buffer] = None ) → Dataset
Parameters
-
buffer (
pyarrow.Buffer
) — Arrow buffer. - info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- split (NamedSplit, optional) — Name of the dataset split.
-
indices_buffer (
pyarrow.Buffer
, optional) — Indices Arrow buffer.
Returns
Instantiate a Dataset backed by an Arrow buffer.
from_pandas
< source >( df: DataFrame features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None preserve_index: typing.Optional[bool] = None ) → Dataset
Parameters
-
df (
pandas.DataFrame
) — Dataframe that contains the dataset. - features (Features, optional) — Dataset features.
- info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- split (NamedSplit, optional) — Name of the dataset split.
-
preserve_index (
bool
, optional) — Whether to store the index as an additional column in the resulting Dataset. The default of None will store the index as a column, except for RangeIndex which is stored as metadata only. Use preserve_index=True to force it to be stored as a column.
Returns
Convert pandas.DataFrame
to a pyarrow.Table
to create a Dataset.
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing explicit features and passing it to this function.
from_dict
< source >( mapping: dict features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[typing.Any] = None split: typing.Optional[typing.Any] = None ) → Dataset
Parameters
-
mapping (
Mapping
) — Mapping of strings to Arrays or Python lists. - features (Features, optional) — Dataset features.
- info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- split (NamedSplit, optional) — Name of the dataset split.
Returns
Convert dict
to a pyarrow.Table
to create a Dataset.
The Apache Arrow table backing the dataset.
The cache files containing the Apache Arrow table backing the dataset.
Number of columns in the dataset.
Number of rows in the dataset (same as Dataset.len()).
Names of the columns in the dataset.
Shape of the dataset (number of columns, number of rows).
unique
< source >(
column: str
)
→
list
Parameters
-
column (
str
) — Column name (list all the column names with datasets.Dataset.column_names()).
Returns
list
List of unique elements in the given column.
Return a list of the unique elements in a column.
This is implemented in the low-level backend and as such, very fast.
flatten
< source >( new_fingerprint max_depth = 16 ) → Dataset
Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
cast
< source >( features: Features batch_size: typing.Optional[int] = 10000 keep_in_memory: bool = False load_from_cache_file: bool = True cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 10000 num_proc: typing.Optional[int] = None ) → Dataset
Parameters
-
features (datasets.Features) — New features to cast the dataset to.
The name of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. string <-> ClassLabel you should use
map
to update the Dataset. -
batch_size (
int
, defaults to 1000) — Number of examples per batch provided to cast. batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to cast. -
keep_in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory. -
load_from_cache_file (
bool
, default True if caching is enabled) — If a cache file storing the current computation from function can be identified, use it instead of recomputing. -
cache_file_name (
str
, optional, default None) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
num_proc (
int
, optional, default None) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
cast_column
< source >( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] new_fingerprint: str ) → Dataset
Cast column to feature for decoding.
remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] new_fingerprint ) → Dataset
Parameters
-
column_names (
Union[str, List[str]]
) — Name of the column(s) to remove. -
new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them.
You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.
rename_column
< source >( original_column_name: str new_column_name: str new_fingerprint ) → Dataset
Parameters
-
original_column_name (
str
) — Name of the column to rename. -
new_column_name (
str
) — New name for the column. -
new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
rename_columns
< source >( column_mapping: typing.Dict[str, str] new_fingerprint ) → Dataset
Parameters
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
class_encode_column
< source >( column: str include_nulls: bool = False )
Parameters
- column (str) — The name of the column to cast (list all the column names with datasets.Dataset.column_names())
-
include_nulls (bool, default False) —
Whether to include null values in the class labels. If True, the null values will be encoded as the “None” class label.
New in version 1.14.2
Casts the given column as :obj:datasets.features.ClassLabel
and updates the table.
Number of rows in the dataset.
Iterate through the examples.
If a formatting is set with Dataset.set_format() rows will be returned with the selected format.
formatted_as
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow']
None means__getitem__
returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output None means__getitem__
returns all columns (default) -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
To be used in a with
statement. Set __getitem__
return format (type and columns).
set_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means getitem returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output. None means getitem returns all columns (default). -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
Set getitem return format (type and columns). The data formatting is applied on-the-fly.
The format type
(for example “numpy”) is used to format batches when using getitem.
It’s also possible to use custom transforms for formatting using datasets.Dataset.set_transform().
It is possible to call map
after calling set_format
. Since map
may add new columns, then the list of formatted columns
gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted:
new formatted columns = (all columns - previously unformatted columns)
set_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
-
transform (
Callable
, optional) — user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in getitem. -
columns (
List[str]
, optional) — columns to format in the output If specified, then the input batch of the transform only contains those columns. -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.
Set getitem return format using this transform. The transform is applied on-the-fly on batches when getitem is called. As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format()
Reset getitem return format to python objects and all columns.
Same as self.set_format()
with_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means getitem returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output None means getitem returns all columns (default) -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
Set getitem return format (type and columns). The data formatting is applied on-the-fly.
The format type
(for example “numpy”) is used to format batches when using getitem.
It’s also possible to use custom transforms for formatting using datasets.Dataset.with_transform().
Contrary to datasets.Dataset.set_format(), with_format
returns a new Dataset object.
with_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
-
transform (
Callable
, optional) — user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in getitem. -
columns (
List[str]
, optional) — columns to format in the output If specified, then the input batch of the transform only contains those columns. -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.
Set getitem return format using this transform. The transform is applied on-the-fly on batches when getitem is called.
As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format().
Contrary to datasets.Dataset.set_transform(), with_transform
returns a new Dataset object.
Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
Be careful when running this command that no other process is currently using other cache files.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: bool = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None )
Parameters
-
function (
Callable
) — Function with one of the following signatures:- function(example: Union[Dict, Any]) -> Union[Dict, Any] if batched=False and with_indices=False and with_rank=False
- function(example: Union[Dict, Any], extra_args) -> Union[Dict, Any] if batched=False and with_indices=True and/or with_rank=True* (one extra arg for each)
- function(batch: Union[Dict[List], List[Any]]) -> Union[Dict, Any] if batched=True and with_indices=False and with_rank=False
- function(batch: Union[Dict[List], List[Any]], extra_args) -> Union[Dict, Any] if batched=True and with_indices=True and/or with_rank=True* (one extra arg for each)
If no function is provided, default to identity function:
lambda x: x
. -
with_indices (
bool
, default False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx[, rank]): …. -
with_rank (
bool
, default False) — Provide process rank to function. Note that in this case the signature of function should be def function(example[, idx], rank): …. - input_columns (Optional[Union[str, List[str]]], default None) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
-
batched (
bool
, default False) — Provide batch of examples to function. -
batch_size (
int
, optional, default 1000) — Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function. -
drop_last_batch (
bool
, default False) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function. - remove_columns (Optional[Union[str, List[str]]], default None) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
-
keep_in_memory (
bool
, default False) — Keep the dataset in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True if caching is enabled) — If a cache file storing the current computation from function can be identified, use it instead of recomputing. -
cache_file_name (
str
, optional, default None) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). - features (Optional[datasets.Features], default None) — Use a specific Features to store the cache file instead of the automatically generated one.
-
disable_nullable (
bool
, default False) — Disallow null values in the table. -
fn_kwargs (
Dict
, optional, default None) — Keyword arguments to be passed to function. -
num_proc (
int
, optional, default None) — Max number of processes when generating cache. Already cached shards are loaded sequentially -
suffix_template (
str
) — If cachefile_name is specified, then this suffix will be added at the end of the base name of each: defaults to ”{rank:05d}of{num_proc:05d}“. For example, if cache_file_name is “processed.arrow”, then for rank=1 and num_proc=4, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix. -
new_fingerprint (
str
, optional, default None) — the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments. -
desc (
str
, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {“text”: “Hello there !“}
- If batched is True and batch_size is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}
- If batched is True and batch_size is
n
> 1, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is {“text”: [“Hello there !”] * n}
filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: bool = True cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None )
Parameters
-
function (
Callable
) — Callable with one of the following signatures:function(example: Union[Dict, Any]) -> bool
ifwith_indices=False, batched=False
function(example: Union[Dict, Any], indices: int) -> bool
ifwith_indices=True, batched=False
function(example: Union[Dict, Any]) -> List[bool]
ifwith_indices=False, batched=True
function(example: Union[Dict, Any], indices: int) -> List[bool]
ifwith_indices=True, batched=True
If no function is provided, defaults to an always True function:
lambda x: True
. -
with_indices (
bool
, default False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): …. -
input_columns (
str
or List[str], optional) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument. -
batched (
bool
, defaults to False) — Provide batch of examples to function -
batch_size (
int
, optional, default 1000) — Number of examples per batch provided to function ifbatched = True
. Ifbatched = False
, one example per batch is passed tofunction
. Ifbatch_size <= 0
orbatch_size == None
: provide the full dataset as a single batch to function -
keep_in_memory (
bool
, default False) — Keep the dataset in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the current computation from function can be identified, use it instead of recomputing. -
cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
fn_kwargs (
dict
, optional) — Keyword arguments to be passed to function -
num_proc (
int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing. -
suffix_template (
str
) — If cache_file_name is specified, then this suffix will be added at the end of the base name of each. For example, if cache_file_name is “processed.arrow”, then forrank = 1
andnum_proc = 4
, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix (default {rank:05d}_of{num_proc:05d}) -
new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments. -
desc (
str
, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
select
< source >( indices: typing.Iterable keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
- indices (sequence, iterable, ndarray or Series) — List or 1D-array of integer indices for indexing.
-
keep_in_memory (
bool
, default False) — Keep the indices mapping in memory instead of writing it to a cache file. -
indices_cache_file_name (
str
, optional, default None) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
new_fingerprint (
str
, optional, default None) — the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create a new dataset with rows selected following the list/array of indices.
sort
< source >( column: str reverse: bool = False kind: str = None null_placement: str = 'last' keep_in_memory: bool = False load_from_cache_file: bool = True indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
-
column (
str
) — column name to sort by. -
reverse (
bool
, default False) — If True, sort by descending order rather then ascending. -
kind (
str
, optional) — Pandas algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility. -
null_placement (
str
, default last) — Put None values at the beginning if ‘first‘; ‘last‘ puts None values at the end.New in version 1.14.2
-
keep_in_memory (
bool
, default False) — Keep the sorted indices in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the sorted indices can be identified, use it instead of recomputing. -
indices_cache_file_name (
str
, optional, default None) — Provide the name of a path for the cache file. It is used to store the sorted indices instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory. -
new_fingerprint (
str
, optional, default None) — the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create a new dataset sorted according to a column.
Currently sorting according to a column name uses pandas sorting algorithm under the hood. The column should thus be a pandas compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).
shuffle
< source >( seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: bool = True indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
-
seed (
int
, optional) — A seed to initialize the default BitGenerator ifgenerator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. -
generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). -
keep_in_memory (
bool
, default False) — Keep the shuffled indices in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the shuffled indices can be identified, use it instead of recomputing. -
indices_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
new_fingerprint (
str
, optional, default None) — the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create a new Dataset where the rows are shuffled.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
train_test_split
< source >( test_size: typing.Union[float, int, NoneType] = None train_size: typing.Union[float, int, NoneType] = None shuffle: bool = True seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: bool = True train_indices_cache_file_name: typing.Optional[str] = None test_indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 train_new_fingerprint: typing.Optional[str] = None test_new_fingerprint: typing.Optional[str] = None )
Parameters
-
test_size (
numpy.random.Generator
, optional) — Size of the test split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25. -
train_size (
numpy.random.Generator
, optional) — Size of the train split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. -
shuffle (
bool
, optional, default True) — Whether or not to shuffle the data before splitting. -
seed (
int
, optional) — A seed to initialize the default BitGenerator ifgenerator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. -
generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). -
keep_in_memory (
bool
, default False) — Keep the splits indices in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the splits indices can be identified, use it instead of recomputing. -
train_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the train split indices instead of the automatically generated cache file name. -
test_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the test split indices instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
train_new_fingerprint (
str
, optional, defaults to None) — the new fingerprint of the train set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments -
test_new_fingerprint (
str
, optional, defaults to None) — the new fingerprint of the test set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Return a dictionary (datasets.DatsetDict
) with two random train and test subsets (train and test Dataset
splits).
Splits are created from the dataset according to test_size, train_size and shuffle.
This method is similar to scikit-learn train_test_split with the omission of the stratified options.
shard
< source >( num_shards: int index: int contiguous: bool = False keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
-
num_shards (
int
) — How many shards to split the dataset into. -
index (
int
) — Which shard to select and return. contiguous — (bool
, default False): Whether to select contiguous blocks of indices for shards. -
keep_in_memory (
bool
, default False) — Keep the dataset in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the current computation from function can be identified, use it instead of recomputing. -
indices_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().
Return the index-nth shard from dataset split into num_shards pieces.
This shards deterministically. dset.shard(n, i) will contain all elements of dset whose index mod n = i.
dset.shard(n, i, contiguous=True) will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l, then the first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i). datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)]) will return a dataset with the same order as the original.
Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.
to_tf_dataset
< source >( columns: typing.Union[str, typing.List[str]] batch_size: int shuffle: bool collate_fn: typing.Callable drop_remainder: bool = None collate_fn_args: typing.Dict[str, typing.Any] = None label_cols: typing.Union[str, typing.List[str]] = None dummy_labels: bool = False prefetch: bool = True )
Parameters
-
columns (
List[str]
orstr
) — Dataset column(s) to load in the tf.data.Dataset. In general, only columns that the model can use as input should be included here (numeric data only). -
batch_size (
int
) — Size of batches to load from the dataset. shuffle(bool
) — Shuffle the dataset order when loading. Recommended True for training, False for validation/evaluation. -
drop_remainder(
bool
, defaultNone
) — Drop the last incomplete batch when loading. If not provided, defaults to the same setting as shuffle. collate_fn(Callable
) — A function or callable object (such as a DataCollator) that will collate lists of samples into a batch. -
collate_fn_args (
Dict
, optional) — An optional dict of keyword arguments to be passed to the collate_fn. -
label_cols (
List[str]
orstr
, defaultNone
) — Dataset column(s) to load as labels. Note that many models compute loss internally rather than letting Keras do it, in which case it is not necessary to actually pass the labels here, as long as they’re in the input columns. -
dummy_labels (
bool
, defaultFalse
) — If no label_cols are set, output an array of “dummy” labels with each batch. This can avoid problems with fit() or train_on_batch() that expect labels to be a Tensor or np.ndarray, but should (hopefully) not be necessary with our standard train_step(). -
prefetch (
bool
, defaultTrue
) — Whether to run the dataloader in a separate thread and maintain a small buffer of batches for training. Improves performance by allowing data to be loaded in the background while the model is training.
Create a tf.data.Dataset from the underlying Dataset. This tf.data.Dataset will load and collate batches from the Dataset, and is suitable for passing to methods like model.fit() or model.predict().
push_to_hub
< source >( repo_id: str split: typing.Optional[str] = None private: typing.Optional[bool] = False token: typing.Optional[str] = None branch: typing.Optional[str] = None shard_size: typing.Optional[int] = 524288000 embed_external_files: bool = True )
Parameters
-
repo_id (
str
) — The ID of the repository to push to in the following format:/ or/ . Also accepts, which will default to the namespace of the logged-in user. -
split (Optional,
str
) — The name of the split that will be given to that dataset. Defaults to self.split. -
private (Optional
bool
, defaults toFalse
) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter. -
token (Optional
str
) — An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in withhuggingface-cli login
. Will raise an error if no token is passed and the user is not logged-in. -
branch (Optional
str
) — The git branch on which to push the dataset. This defaults to the default branch as specified in your repository, which defaults to “main”. -
shard_size (Optional
int
) — The size of the dataset shards to be uploaded to the hub. The dataset will be pushed in files of the size specified here, in bytes. Defaults to a shard size of 500MB. -
embed_external_files (
bool
, defaultTrue
) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:- Audio and class:Image: remove local path information and embed file content in the Parquet files.
Pushes the dataset to the hub. The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
Example:
>>> dataset.push_to_hub("<organization>/<dataset_id>", split="evaluation")
save_to_disk
< source >( dataset_path: str fs = None )
Parameters
-
dataset_path (
str
) — Path (e.g. dataset/train) or remote URI (e.g. s3://my-bucket/dataset/train) of the dataset directory where the dataset will be saved to. -
fs (S3FileSystem,
fsspec.spec.AbstractFileSystem
, optional, defaultsNone
) — Instance of the remote filesystem used to download the files from.
Saves a dataset to a dataset directory, or in a filesystem using either S3FileSystem or
any implementation of fsspec.spec.AbstractFileSystem
.
load_from_disk
< source >( dataset_path: str fs = None keep_in_memory: typing.Optional[bool] = None ) → Dataset or DatasetDict
Parameters
-
dataset_path (
str
) — Path (e.g. “dataset/train”) or remote URI (e.g. “s3//my-bucket/dataset/train”) of the dataset directory where the dataset will be loaded from. -
fs (S3FileSystem,
fsspec.spec.AbstractFileSystem
, optional, defaultNone
) — Instance of the remote filesystem used to download the files from. -
keep_in_memory (
bool
, defaultNone
) — Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the load_dataset_enhancing_performance section.
Returns
- If dataset_path is a path of a dataset directory: the dataset requested.
- If dataset_path is a path of a dataset dict directory: a
datasets.DatasetDict
with each split.
Loads a dataset that was previously saved using save_to_disk
from a dataset directory, or from a
filesystem using either S3FileSystem or any implementation of
fsspec.spec.AbstractFileSystem
.
flatten_indices
< source >( keep_in_memory: bool = False cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False new_fingerprint: typing.Optional[str] = None )
Parameters
-
keep_in_memory (
bool
, default False) — Keep the dataset in memory instead of writing it to a cache file. -
cache_file_name (
str
, optional, default None) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). - features (Optional[datasets.Features], default None) — Use a specific Features to store the cache file instead of the automatically generated one.
-
disable_nullable (
bool
, default False) — Allow null values in the table. -
new_fingerprint (
str
, optional, default None) — The new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create and cache a new Dataset by flattening the indices mapping.
to_csv
< source >( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None num_proc: typing.Optional[int] = None **to_csv_kwargs ) → int
Parameters
-
path_or_buf (
PathLike
orFileOrBuffer
) — Either a path to a file or a BinaryIO. -
batch_size (
int
, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
. -
num_proc (
int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.batch_size
in this case defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
but feel free to make it 5x or 10x of the default value if you have sufficient compute power. to_csv_kwargs — Parameters to pass to pandas’spandas.DataFrame.to_csv
Returns
int
The number of characters or bytes written
Exports the dataset to csv
to_pandas
< source >( batch_size: typing.Optional[int] = None batched: bool = False )
Parameters
-
batched (
bool
) — Set toTrue
to return a generator that yields the dataset as batches ofbatch_size
rows. Defaults toFalse
(returns the whole datasetas once) -
batch_size (
int
, optional) — The size (number of rows) of the batches ifbatched
is True. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a pandas.DataFrame
. Can also return a generator for large datasets.
to_dict
< source >( batch_size: typing.Optional[int] = None batched: bool = False )
Parameters
-
batched (
bool
) — Set toTrue
to return a generator that yields the dataset as batches ofbatch_size
rows. Defaults toFalse
(returns the whole datasetas once) -
batch_size (
int
, optional) — The size (number of rows) of the batches ifbatched
is True. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a Python dict. Can also return a generator for large datasets.
to_json
< source >( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None num_proc: typing.Optional[int] = None **to_json_kwargs ) → int
Parameters
-
path_or_buf (
PathLike
orFileOrBuffer
) — Either a path to a file or a BinaryIO. -
batch_size (
int
, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
. -
num_proc (
int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.batch_size
in this case defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
but feel free to make it 5x or 10x of the default value if you have sufficient compute power. -
lines (
bool
, defaultTrue
) — Whether output JSON lines format. Only possible if `orient="records"
. It will throw ValueError withorient
different from"records"
, since the others are not list-like. - orient (str
, default"records"
) — Format of the JSON:"records"
: list like[{column -> value}, … , {column -> value}]
"split"
: dict like{"index" -> [index], "columns" -> [columns], "data" -> [values]}
"index"
: dict like{index -> {column -> value}}
"columns"
: dict like{column -> {index -> value}}
"values"
: just the values array"table"
: dict like{"schema": {schema}, "data": {data}}
**to_json_kwargs — Parameters to pass to pandas’s pandas.DataFrame.to_json.
Returns
int
The number of characters or bytes written.
Export the dataset to JSON Lines or JSON.
to_parquet
< source >( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO] batch_size: typing.Optional[int] = None **parquet_writer_kwargs ) → int
Parameters
-
path_or_buf (
PathLike
orFileOrBuffer
) — Either a path to a file or a BinaryIO. -
batch_size (
int
, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
. parquet_writer_kwargs — Parameters to pass to PyArrow’spyarrow.parquet.ParquetWriter
Returns
int
The number of characters or bytes written
Exports the dataset to parquet
add_faiss_index
< source >( column: str index_name: typing.Optional[str] = None device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )
Parameters
-
column (
str
) — The column of the vectors to add to the index. -
index_name (Optional
str
) — The index_name/identifier of the index. This is the index_name that is used to calldatasets.Dataset.get_nearest_examples()
ordatasets.Dataset.search()
By default it corresponds to column. -
device (Optional
Union[int, List[int]]
) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. -
string_factory (Optional
str
) — This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat
. -
metric_type (Optional
int
) — Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2. -
custom_index (Optional
faiss.Index
) — Custom Faiss index that you already have instantiated and configured for your needs. -
train_size (Optional
int
) — If the index needs a training step, specifies how many vectors will be used to train the index. -
faiss_verbose (
bool
, defaults to False) — Enable the verbosity of the Faiss index. -
dtype (data-type) — The dtype of the numpy arrays that are indexed.
Default is
np.float32
.
Add a dense index using Faiss for fast retrieval.
By default the index is done over the vectors of the specified column.
You can specify device
if you want to run it on GPU (device
must be the GPU index).
You can find more information about Faiss here:
- For string factory
Example:
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
>>> ds_with_embeddings.add_faiss_index(column='embeddings')
>>> # query
>>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
>>> # save index
>>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> # load index
>>> ds.load_faiss_index('embeddings', 'my_index.faiss')
>>> # query
>>> scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)
add_faiss_index_from_external_arrays
< source >( external_arrays: array index_name: str device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )
Parameters
-
external_arrays (
np.array
) — If you want to use arrays from outside the lib for the index, you can setexternal_arrays
. It will useexternal_arrays
to create the Faiss index instead of the arrays in the givencolumn
. -
index_name (
str
) — The index_name/identifier of the index. This is the index_name that is used to calldatasets.Dataset.get_nearest_examples()
ordatasets.Dataset.search()
-
device (Optional
Union[int, List[int]]
) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. -
string_factory (Optional
str
) — This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat
. -
metric_type (Optional
int
) — Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2. -
custom_index (Optional
faiss.Index
) — Custom Faiss index that you already have instantiated and configured for your needs. -
train_size (Optional
int
) — If the index needs a training step, specifies how many vectors will be used to train the index. -
faiss_verbose (
bool
, defaults to False) — Enable the verbosity of the Faiss index. -
dtype (
numpy.dtype
) — The dtype of the numpy arrays that are indexed. Default is np.float32.
Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:
- For string factory
save_faiss_index
< source >( index_name: str file: typing.Union[str, pathlib.PurePath] )
Save a FaissIndex on disk.
load_faiss_index
< source >( index_name: str file: typing.Union[str, pathlib.PurePath] device: typing.Union[int, typing.List[int], NoneType] = None )
Parameters
-
index_name (
str
) — The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search. -
file (
str
) — The path to the serialized faiss index on disk. -
device (Optional
Union[int, List[int]]
) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
Load a FaissIndex from disk.
If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index to make it fit your needs.
add_elasticsearch_index
< source >( column: str index_name: typing.Optional[str] = None host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('elasticsearch.Elasticsearch')] = None es_index_name: typing.Optional[str] = None es_index_config: typing.Optional[dict] = None )
Parameters
-
column (
str
) — The column of the documents to add to the index. -
index_name (Optional
str
) — The index_name/identifier of the index. This is the index name that is used to callDataset.get_nearest_examples()
orDataset.search()
By default it corresponds tocolumn
. -
host (Optional
str
, defaults to localhost) — host of where ElasticSearch is running -
port (Optional
str
, defaults to 9200) — port of where ElasticSearch is running -
es_client (Optional
elasticsearch.Elasticsearch
) — The elasticsearch client used to create the index if host and port are None. -
es_index_name (Optional
str
) — The elasticsearch index name used to create the index. -
es_index_config (Optional
dict
) — The configuration of the elasticsearch index.
Add a text index using ElasticSearch for fast retrieval. This is done in-place.
Default config is:
{
"settings": {
"number_of_shards": 1,
"analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}},
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "standard",
"similarity": "BM25"
},
}
},
}
Example:
>>> es_client = elasticsearch.Elasticsearch()
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
>>> scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
load_elasticsearch_index
< source >( index_name: str es_index_name: str host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('Elasticsearch')] = None es_index_config: typing.Optional[dict] = None )
Parameters
-
index_name (
str
) — The index_name/identifier of the index. This is the index name that is used to call .get_nearest or .search. -
es_index_name (
str
) — The name of elasticsearch index to load. -
host (Optional
str
, defaults to localhost) — host of where ElasticSearch is running -
port (Optional
str
, defaults to 9200) — port of where ElasticSearch is running -
es_client (Optional
elasticsearch.Elasticsearch
) — The elasticsearch client used to create the index if host and port are None. -
es_index_config (Optional
dict
) — The configuration of the elasticsearch index.
Load an existing text index using ElasticSearch for fast retrieval.
Default config is:
{
"settings": {
"number_of_shards": 1,
"analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}},
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "standard",
"similarity": "BM25"
},
}
},
}
List the colindex_nameumns/identifiers of all the attached indexes.
get_index
< source >(
index_name: str
)
→
BaseIndex
List the index_name/identifiers of all the attached indexes.
drop_index
< source >( index_name: str )
Drop the index with the specified column.
search
< source >(
index_name: str
query: typing.Union[str, <built-in function array>]
k: int = 10
)
→
scores (List[List[float]
)
Parameters
-
index_name (
str
) — The name/identifier of the index. -
query (
Union[str, np.ndarray]
) — The query as a string if index_name is a text index or as a numpy array if index_name is a vector index. -
k (
int
) — The number of examples to retrieve.
Returns
scores (List[List[float]
)
The retrieval scores of the retrieved examples.
indices (List[List[int]]
): The indices of the retrieved examples.
Find the nearest examples indices in the dataset to the query.
search_batch
< source >(
index_name: str
queries: typing.Union[typing.List[str], <built-in function array>]
k: int = 10
)
→
total_scores (List[List[float]
)
Parameters
-
index_name (
str
) — The index_name/identifier of the index. -
queries (
Union[List[str], np.ndarray]
) — The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index. -
k (
int
) — The number of examples to retrieve per query.
Returns
total_scores (List[List[float]
)
The retrieval scores of the retrieved examples per query.
total_indices (List[List[int]]
): The indices of the retrieved examples per query.
Find the nearest examples indices in the dataset to the query.
get_nearest_examples
< source >(
index_name: str
query: typing.Union[str, <built-in function array>]
k: int = 10
)
→
scores (List[float]
)
Parameters
-
index_name (
str
) — The index_name/identifier of the index. -
query (
Union[str, np.ndarray]
) — The query as a string if index_name is a text index or as a numpy array if index_name is a vector index. -
k (
int
) — The number of examples to retrieve.
Returns
scores (List[float]
)
The retrieval scores of the retrieved examples.
examples (dict
): The retrieved examples.
Find the nearest examples in the dataset to the query.
get_nearest_examples_batch
< source >( index_name: str queries: typing.Union[typing.List[str], <built-in function array>] k: int = 10 ) → total_scores (List[List[float])
Parameters
-
index_name (
str
) — The index_name/identifier of the index. -
queries (
Union[List[str], np.ndarray]
) — The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index. -
k (
int
) — The number of examples to retrieve per query.
Returns
total_scores (List[List[float])
The retrieval scores of the retrieved examples per query. total_examples (List[dict]): The retrieved examples per query.
Find the nearest examples in the dataset to the query.
datasets.DatasetInfo object containing all the metadata in the dataset.
datasets.NamedSplit object corresponding to a named dataset split.
from_csv
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs ) → Dataset
Parameters
- path_or_paths (path-like or list of path-like) — Path(s) of the CSV file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, default"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory. **kwargs — Keyword arguments to be passed topandas.read_csv
.
Returns
Create Dataset from CSV file(s).
from_json
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False field: typing.Optional[str] = None **kwargs ) → Dataset
Parameters
- path_or_paths (path-like or list of path-like) — Path(s) of the JSON or JSON Lines file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, default"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory. -
field (
str
, optional) — Field name of the JSON file where the dataset is contained in. **kwargs — Keyword arguments to be passed toJsonConfig
.
Returns
Create Dataset from JSON or JSON Lines file(s).
from_parquet
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[typing.List[str]] = None **kwargs ) → Dataset
Parameters
- path_or_paths (path-like or list of path-like) — Path(s) of the Parquet file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, default"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory. -
columns (
List[str]
, optional) — If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. **kwargs — Keyword arguments to be passed toParquetConfig
.
Returns
Create Dataset from Parquet file(s).
from_text
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs ) → Dataset
Parameters
- path_or_paths (path-like or list of path-like) — Path(s) of the text file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, default"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaultFalse
) — Whether to copy the data in-memory. **kwargs — Keyword arguments to be passed toTextConfig
.
Returns
Create Dataset from text file(s).
prepare_for_task
< source >( task: typing.Union[str, datasets.tasks.base.TaskTemplate] id: int = 0 )
Parameters
-
task (
Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. Ifstr
, supported tasks include:"text-classification"
"question-answering"
If
TaskTemplate
must be one of the task templates indatasets.tasks
. -
id (
int
, defaults to 0) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
align_labels_with_mapping
< source >( label2id: typing.Dict label_column: str )
Align the dataset’s label ID and label name mapping to match an input label2id
mapping.
This is useful when you want to ensure that a model’s predicted labels are aligned with the dataset.
The alignment in done using the lowercase label names.
Example:
>>> # dataset with mapping {'entailment': 0, 'neutral': 1, 'contradiction': 2}
>>> ds = load_dataset("glue", "mnli", split="train")
>>> # mapping to align with
>>> label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2}
>>> ds_aligned = ds.align_labels_with_mapping(label2id, "label")
datasets.concatenate_datasets
< source >( dsets: typing.List[datasets.arrow_dataset.Dataset] info: typing.Optional[typing.Any] = None split: typing.Optional[typing.Any] = None axis: int = 0 )
Parameters
-
dsets (
List[datasets.Dataset]
) — List of Datasets to concatenate. - info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- split (NamedSplit, optional) — Name of the dataset split.
-
axis (
{0, 1}
, default0
, meaning over rows) — Axis to concatenate over, where0
means over rows (vertically) and1
means over columns (horizontally).New in version 1.6.0
Converts a list of Dataset with the same schema into a single Dataset.
datasets.interleave_datasets
< source >( datasets: typing.List[~DatasetType] probabilities: typing.Optional[typing.List[float]] = None seed: typing.Optional[int] = None ) → Dataset or IterableDataset
Parameters
-
datasets (
List[Dataset]
orList[IterableDataset]
) — list of datasets to interleave -
probabilities (
List[float]
, optional, default None) — If specified, the new dataset is constructued by sampling examples from one source at a time according to these probabilities. -
seed (
int
, optional, default None) — The random seed used to choose a source for each example. **kwargs — For map-style datasets: Keyword arguments to be passed to Dataset.select() when selecting the indices used to interleave the datasets.
Returns
Return type depends on the input datasets parameter. Dataset if the input is a list of Dataset, IterableDataset if the input is a list of IterableDataset.
Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples.
You can use this function on a list of Dataset objects, or on a list of IterableDataset objects.
If probabilities
is None
(default) the new dataset is constructed by cycling between each source to get the examples.
If probabilities
is not None
, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities.
The resulting dataset ends when one of the source datasets runs out of examples.
Example:
For regular datasets (map-style):
>>> from datasets import Dataset, interleave_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3])
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12]
For datasets in streaming mode (iterable):
>>> from datasets import load_dataset, interleave_datasets
>>> d1 = load_dataset("oscar", "unshuffled_deduplicated_en", split="train", streaming=True)
>>> d2 = load_dataset("oscar", "unshuffled_deduplicated_fr", split="train", streaming=True)
>>> dataset = interleave_datasets([d1, d2])
>>> iterator = iter(dataset)
>>> next(iterator)
{'text': 'Mtendere Village was inspired by the vision...
>>> next(iterator)
{'text': "Média de débat d'idées, de culture...
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use datasets.Dataset.save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect datasets.load_dataset(). If you want to regenerate a dataset from scratch you should use
the
download_mode
parameter in datasets.load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use datasets.Dataset.save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect datasets.load_dataset(). If you want to regenerate a dataset from scratch you should use
the
download_mode
parameter in datasets.load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use datasets.Dataset.save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect datasets.load_dataset(). If you want to regenerate a dataset from scratch you should use
the
download_mode
parameter in datasets.load_dataset().
DatasetDict[[datasets.DatasetDict]]
Dictionary with split names as keys (‘train’, ‘test’ for example), and datasets.Dataset
objects as values.
It also has dataset transform methods like map or filter, to process all the splits at once.
A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)
The Apache Arrow tables backing each split.
The cache files containing the Apache Arrow table backing each split.
Number of columns in each split of the dataset.
Number of rows in each split of the dataset (same as datasets.Dataset.len()).
Names of the columns in each split of the dataset.
Shape of each split of the dataset (number of columns, number of rows).
unique
< source >(
column: str
)
→
Dict[str
, list
]
Parameters
-
column (
str
) — column name (list all the column names with datasets.Dataset.column_names())
Returns
Dict[str
, list
]
Dictionary of unique elements in the given column.
Return a list of the unique elements in a column for each split.
This is implemented in the low-level backend and as such, very fast.
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be carefull when running this command that no other process is currently using other cache files.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: bool = True cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )
Parameters
-
function (callable) — with one of the following signature:
- function(example: Dict) -> Union[Dict, Any] if batched=False and with_indices=False
- function(example: Dict, indices: int) -> Union[Dict, Any] if batched=False and with_indices=True
- function(batch: Dict[List]) -> Union[Dict, Any] if batched=True and with_indices=False
- function(batch: Dict[List], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True
- with_indices (bool, defaults to False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
-
with_rank (
bool
, default False) — Provide process rank to function. Note that in this case the signature of function should be def function(example[, idx], rank): …. - input_columns (Optional[Union[str, List[str]]], defaults to None) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
- batched (bool, defaults to False) — Provide batch of examples to function
-
batch_size (
int
, optional, defaults to 1000) — Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function -
drop_last_batch (
bool
, default False) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function. - remove_columns (Optional[Union[str, List[str]]], defaults to None) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
- keep_in_memory (bool, defaults to False) — Keep the dataset in memory instead of writing it to a cache file.
- load_from_cache_file (bool, defaults to True) — If a cache file storing the current computation from function can be identified, use it instead of recomputing.
-
cache_file_names (Optional[Dict[str, str]], defaults to None) — Provide the name of a path for the cache file. It is used to store the
results of the computation instead of the automatically generated cache file name.
You have to provide one
cache_file_name
per dataset in the dataset dictionary. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). - features (Optional[datasets.Features], defaults to None) — Use a specific Features to store the cache file instead of the automatically generated one.
- disable_nullable (bool, defaults to False) — Disallow null values in the table.
-
fn_kwargs (
Dict
, optional, defaults to None) — Keyword arguments to be passed to function -
num_proc (
int
, optional, defaults to None) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing. -
desc (
str
, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.
filter
< source >( function with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: bool = True cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )
Parameters
-
function (callable) — with one of the following signature:
function(example: Union[Dict, Any]) -> bool
ifwith_indices=False, batched=False
function(example: Union[Dict, Any], indices: int) -> bool
ifwith_indices=True, batched=False
function(example: Union[Dict, Any]) -> List[bool]
ifwith_indices=False, batched=True
function(example: Union[Dict, Any], indices: int) -> List[bool]
ifwith_indices=True, batched=True
- with_indices (bool, defaults to False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
- input_columns (Optional[Union[str, List[str]]], defaults to None) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
- batched (bool, defaults to False) — Provide batch of examples to function
-
batch_size (
int
, optional, defaults to 1000) — Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function - keep_in_memory (bool, defaults to False) — Keep the dataset in memory instead of writing it to a cache file.
- load_from_cache_file (bool, defaults to True) — If a cache file storing the current computation from function can be identified, use it instead of recomputing.
-
cache_file_names (Optional[Dict[str, str]], defaults to None) — Provide the name of a path for the cache file. It is used to store the
results of the computation instead of the automatically generated cache file name.
You have to provide one
cache_file_name
per dataset in the dataset dictionary. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map(). -
fn_kwargs (
Dict
, optional, defaults to None) — Keyword arguments to be passed to function -
num_proc (
int
, optional, defaults to None) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing. -
desc (
str
, optional, defaults to None) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.
sort
< source >( column: str reverse: bool = False kind: str = None null_placement: str = 'last' keep_in_memory: bool = False load_from_cache_file: bool = True indices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
-
column (
str
) — column name to sort by. -
reverse (
bool
, default False) — If True, sort by descending order rather then ascending. -
kind (
str
, optional) — Pandas algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility. -
null_placement (
str
, default last) — Put None values at the beginning if ‘first‘; ‘last‘ puts None values at the end.New in version 1.14.2
-
keep_in_memory (
bool
, default False) — Keep the sorted indices in memory instead of writing it to a cache file. -
load_from_cache_file (
bool
, default True) — If a cache file storing the sorted indices can be identified, use it instead of recomputing. -
indices_cache_file_names (Optional[Dict[str, str]], defaults to None) — Provide the name of a path for the cache file. It is used to store the
indices mapping instead of the automatically generated cache file name.
You have to provide one
cache_file_name
per dataset in the dataset dictionary. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
Create a new dataset sorted according to a column. The transformation is applied to all the datasets of the dataset dictionary.
Currently sorting according to a column name uses pandas sorting algorithm under the hood. The column should thus be a pandas compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).
shuffle
< source >( seeds: typing.Union[int, typing.Dict[str, typing.Optional[int]], NoneType] = None seed: typing.Optional[int] = None generators: typing.Union[typing.Dict[str, numpy.random._generator.Generator], NoneType] = None keep_in_memory: bool = False load_from_cache_file: bool = True indices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
-
seeds (Dict[str, int] or int, optional) — A seed to initialize the default BitGenerator if
generator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide oneseed
per dataset in the dataset dictionary. -
seed (Optional int) — A seed to initialize the default BitGenerator if
generator=None
. Alias for seeds (the seed argument has priority over seeds if both arguments are provided). -
generators (Optional Dict[str, np.random.Generator]) — Numpy random Generator to use to compute the permutation of the dataset rows.
If
generator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). You have to provide onegenerator
per dataset in the dataset dictionary. - keep_in_memory (bool, defaults to False) — Keep the dataset in memory instead of writing it to a cache file.
- load_from_cache_file (bool, defaults to True) — If a cache file storing the current computation from function can be identified, use it instead of recomputing.
-
indices_cache_file_names (Dict[str, str], optional) — Provide the name of a path for the cache file. It is used to store the
indices mappings instead of the automatically generated cache file name.
You have to provide one
cache_file_name
per dataset in the dataset dictionary. -
writer_batch_size (
int
, default 1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .map().
Create a new Dataset where the rows are shuffled.
The transformation is applied to all the datasets of the dataset dictionary.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
set_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’] None means__getitem__
returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output. None means__getitem__
returns all columns (default). -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
Set __getitem__
return format (type and columns)
The format is set for every dataset in the dataset dictionary
It is possible to call map
after calling set_format
. Since map
may add new columns, then the list of formatted columns
gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted:
new formatted columns = (all columns - previously unformatted columns)
Reset __getitem__
return format to python objects and all columns.
The transformation is applied to all the datasets of the dataset dictionary.
Same as self.set_format()
formatted_as
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’] None means__getitem__
returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output None means__getitem__
returns all columns (default) -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
To be used in a with statement. Set __getitem__
return format (type and columns)
The transformation is applied to all the datasets of the dataset dictionary.
with_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
-
type (
str
, optional) — Either output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’, ‘arrow’]. None means__getitem__
returns python objects (default) -
columns (
List[str]
, optional) — columns to format in the output None means__getitem__
returns all columns (default) -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) format_kwargs — keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly.
The format type
(for example “numpy”) is used to format batches when using __getitem__
.
The format is set for every dataset in the dataset dictionary
It’s also possible to use custom transforms for formatting using datasets.Dataset.with_transform().
Contrary to datasets.DatasetDict.set_format(), with_format
returns a new DatasetDict object with new Dataset objects.
with_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
-
transform (
Callable
, optional) — user-defined formatting transform, replaces the format defined by datasets.Dataset.set_format() A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in__getitem__
. -
columns (
List[str]
, optional) — columns to format in the output If specified, then the input batch of the transform only contains those columns. -
output_all_columns (
bool
, default to False) — keep un-formatted columns as well in the output (as python objects) If set to True, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called.
The transform is set for every dataset in the dataset dictionary
As datasets.Dataset.set_format(), this can be reset using datasets.Dataset.reset_format().
Contrary to datasets.DatasetDict.set_transform()
with_transform
returns a new DatasetDict object with new Dataset objects.
Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
cast
< source >( features: Features )
Parameters
-
features (datasets.Features) — New features to cast the dataset to.
The name and order of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. string <-> ClassLabel you should use
map
to update the Dataset.
Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.
You can also remove a column using Dataset.map() with feature but cast_
is in-place (doesn’t copy the data to a new dataset) and is thus faster.
cast_column
< source >( column: str feature ) → DatasetDict
Cast column to feature for decoding.
remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] )
Remove one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.
rename_column
< source >( original_column_name: str new_column_name: str )
Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.
You can also rename a column using Dataset.map() with remove_columns but the present method:
- takes care of moving the original features under the new column name.
- doesn’t copy the data to a new dataset and is thus much faster.
rename_columns
< source >( column_mapping: typing.Dict[str, str] ) → DatasetDict
Parameters
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary.
class_encode_column
< source >( column: str include_nulls: bool = False )
Casts the given column as :obj:datasets.features.ClassLabel
and updates the tables.
push_to_hub
< source >( repo_id private: typing.Optional[bool] = False token: typing.Optional[str] = None branch: NoneType = None shard_size: typing.Optional[int] = 524288000 embed_external_files: bool = True )
Parameters
-
repo_id (
str
) — The ID of the repository to push to in the following format:/ or/ . Also accepts, which will default to the namespace of the logged-in user. -
private (Optional
bool
) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter. -
token (Optional
str
) — An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in withhuggingface-cli login
. Will raise an error if no token is passed and the user is not logged-in. -
branch (Optional
str
) — The git branch on which to push the dataset. -
shard_size (Optional
int
) — The size of the dataset shards to be uploaded to the hub. The dataset will be pushed in files of the size specified here, in bytes. -
embed_external_files (
bool
, defaultTrue
) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:- Audio and class:Image: remove local path information and embed file content in the Parquet files.
Pushes the DatasetDict
to the hub.
The DatasetDict
is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
Each dataset split will be pushed independently. The pushed dataset will keep the original split names.
Example:
>>> dataset_dict.push_to_hub("<organization>/<dataset_id>")
save_to_disk
< source >( dataset_dict_path: str fs = None )
Parameters
-
dataset_dict_path (
str
) — Path (e.g. dataset/train) or remote URI (e.g. s3://my-bucket/dataset/train) of the dataset dict directory where the dataset dict will be saved to. -
fs (S3FileSystem,
fsspec.spec.AbstractFileSystem
, optional, defaultsNone
) — Instance of the remote filesystem used to download the files from.
Saves a dataset dict to a filesystem using either S3FileSystem or
fsspec.spec.AbstractFileSystem
.
load_from_disk
< source >( dataset_dict_path: str fs = None keep_in_memory: typing.Optional[bool] = None ) → DatasetDict
Parameters
-
dataset_dict_path (
str
) — Path (e.g."dataset/train"
) or remote URI (e.g."s3//my-bucket/dataset/train"
) of the dataset dict directory where the dataset dict will be loaded from. -
fs (S3FileSystem or
fsspec.spec.AbstractFileSystem
, optional, defaultNone
) — Instance of the remote filesystem used to download the files from. -
keep_in_memory (
bool
, defaultNone
) — Whether to copy the dataset in-memory. If None, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE to nonzero. See more details in the load_dataset_enhancing_performance section.
Returns
Load a dataset that was previously saved using save_to_disk
from a filesystem using either
S3FileSystem or fsspec.spec.AbstractFileSystem
.
from_csv
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs ) → DatasetDict
Parameters
- path_or_paths (dict of path-like) — Path(s) of the CSV file(s).
- features (Features, optional) — Dataset features.
- cache_dir (str, optional, default=”~/.cache/huggingface/datasets”) — Directory to cache data.
-
keep_in_memory (bool, default=False) — Whether to copy the data in-memory.
**kwargs — Keyword arguments to be passed to
pandas.read_csv
.
Returns
Create DatasetDict from CSV file(s).
from_json
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs ) → DatasetDict
Parameters
- path_or_paths (path-like or list of path-like) — Path(s) of the JSON Lines file(s).
- features (Features, optional) — Dataset features.
- cache_dir (str, optional, default=”~/.cache/huggingface/datasets”) — Directory to cache data.
-
keep_in_memory (bool, default=False) — Whether to copy the data in-memory.
**kwargs — Keyword arguments to be passed to
JsonConfig
.
Returns
Create DatasetDict from JSON Lines file(s).
from_parquet
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[typing.List[str]] = None **kwargs ) → DatasetDict
Parameters
- path_or_paths (dict of path-like) — Path(s) of the CSV file(s).
- features (Features, optional) — Dataset features.
- cache_dir (str, optional, default=”~/.cache/huggingface/datasets”) — Directory to cache data.
- keep_in_memory (bool, default=False) — Whether to copy the data in-memory.
-
columns (
List[str]
, optional) — If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. **kwargs — Keyword arguments to be passed toParquetConfig
.
Returns
Create DatasetDict from Parquet file(s).
from_text
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs ) → DatasetDict
Parameters
- path_or_paths (dict of path-like) — Path(s) of the text file(s).
- features (Features, optional) — Dataset features.
- cache_dir (str, optional, default=”~/.cache/huggingface/datasets”) — Directory to cache data.
-
keep_in_memory (bool, default=False) — Whether to copy the data in-memory.
**kwargs — Keyword arguments to be passed to
TextConfig
.
Returns
Create DatasetDict from text file(s).
prepare_for_task
< source >( task: typing.Union[str, datasets.tasks.base.TaskTemplate] id: int = 0 )
Parameters
-
task (
Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. Ifstr
, supported tasks include:"text-classification"
"question-answering"
If
TaskTemplate
must be one of the task templates indatasets.tasks
. -
id (
int
, defaults to 0) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
IterableDataset[[datasets.IterableDataset]]
The base class datasets.IterableDataset implements an iterable Dataset backed by python generators.
class datasets.IterableDataset
< source >( ex_iterable: _BaseExamplesIterable info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None format_type: typing.Optional[str] = None shuffling: typing.Optional[datasets.iterable_dataset.ShufflingConfig] = None )
A Dataset backed by an iterable.
remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] ) → IterableDataset
Parameters
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset.
cast_column
< source >( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] ) → IterableDataset
Cast column to feature for decoding.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: int = 1000 remove_columns: typing.Union[str, typing.List[str], NoneType] = None )
Parameters
-
function (
Callable
, optional, default None) — Function applied on-the-fly on the examples when you iterate on the dataset It must have one of the following signatures:- function(example: Union[Dict, Any]) -> dict if batched=False and with_indices=False
- function(example: Union[Dict, Any], idx: int) -> dict if batched=False and with_indices=True
- function(batch: Union[Dict[List], List[Any]]) -> dict if batched=True and with_indices=False
- function(batch: Union[Dict[List], List[Any]], indices: List[int]) -> dict if batched=True and with_indices=True
If no function is provided, default to identity function:
lambda x: x
. -
with_indices (
bool
, defaults to False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx[, rank]): …. - input_columns (Optional[Union[str, List[str]]], default None) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
-
batched (
bool
, default False) — Provide batch of examples to function. -
batch_size (
int
, optional, default1000
) — Number of examples per batch provided to function if batched=True. - remove_columns (Optional[List[str]], defaults to None) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {“text”: “Hello there !“}
- If batched is True and batch_size is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}
- If batched is True and batch_size is
n
> 1, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is {“text”: [“Hello there !”] * n}
filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 )
Parameters
-
function (
Callable
) — Callable with one of the following signatures:function(example: Union[Dict, Any]) -> bool
ifwith_indices=False, batched=False
function(example: Union[Dict, Any], indices: int) -> bool
ifwith_indices=True, batched=False
function(example: Union[Dict, Any]) -> List[bool]
ifwith_indices=False, batched=True
function(example: Union[Dict, Any], indices: int) -> List[bool]
ifwith_indices=True, batched=True
If no function is provided, defaults to an always True function:
lambda x: True
. -
with_indices (
bool
, default False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): …. -
input_columns (
str
or List[str], optional) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument. -
batched (
bool
, defaults to False) — Provide batch of examples to function -
batch_size (
int
, optional, default1000
) — Number of examples per batch provided to function if batched=True.
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset.
shuffle
< source >( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )
Parameters
-
seed (
int
, optional, default None) — random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and als oto shuffle the data shards. -
generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). -
buffer_size (
int
, default 1000) — size of the buffer.
Randomly shuffles the elements of this dataset.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1,000, then shuffle will
initially select a random element from only the first 1,000 elements in the buffer. Once an element is
selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element,
maintaining the 1,000 element buffer.
If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using datasets.IterableDataset.skip() or datasets.IterableDataset.take() then the order of the shards is kept unchanged.
Create a new IterableDataset that skips the first n
elements.
Create a new IterableDataset with only the first n
elements.
datasets.DatasetInfo object containing all the metadata in the dataset.
datasets.NamedSplit object corresponding to a named dataset split.
IterableDatasetDict[[datasets.IterableDatasetDict]]
Dictionary with split names as keys (‘train’, ‘test’ for example), and datasets.IterableDataset
objects as values.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: int = 1000 remove_columns: typing.Union[str, typing.List[str], NoneType] = None )
Parameters
-
function (
Callable
, optional, default None) — Function applied on-the-fly on the examples when you iterate on the dataset It must have one of the following signatures:- function(example: Union[Dict, Any]) -> dict if batched=False and with_indices=False
- function(example: Union[Dict, Any], idx: int) -> dict if batched=False and with_indices=True
- function(batch: Union[Dict[List], List[Any]]) -> dict if batched=True and with_indices=False
- function(batch: Union[Dict[List], List[Any]], indices: List[int]) -> dict if batched=True and with_indices=True
If no function is provided, default to identity function:
lambda x: x
. -
with_indices (
bool
, defaults to False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx[, rank]): …. - input_columns (Optional[Union[str, List[str]]], default None) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
-
batched (
bool
, default False) — Provide batch of examples to function. -
batch_size (
int
, optional, default1000
) — Number of examples per batch provided to function if batched=True. - remove_columns (Optional[List[str]], defaults to None) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {“text”: “Hello there !“}
- If batched is True and batch_size is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}
- If batched is True and batch_size is
n
> 1, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is {“text”: [“Hello there !”] * n}
filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 )
Parameters
-
function (
Callable
) — Callable with one of the following signatures:function(example: Union[Dict, Any]) -> bool
ifwith_indices=False, batched=False
function(example: Union[Dict, Any], indices: int) -> bool
ifwith_indices=True, batched=False
function(example: Union[Dict, Any]) -> List[bool]
ifwith_indices=False, batched=True
function(example: Union[Dict, Any], indices: int) -> List[bool]
ifwith_indices=True, batched=True
If no function is provided, defaults to an always True function:
lambda x: True
. -
with_indices (
bool
, default False) — Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): …. -
input_columns (
str
or List[str], optional) — The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument. -
batched (
bool
, defaults to False) — Provide batch of examples to function -
batch_size (
int
, optional, default1000
) — Number of examples per batch provided to function if batched=True.
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary.
shuffle
< source >( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )
Parameters
-
seed (
int
, optional, default None) — random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and als oto shuffle the data shards. -
generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). -
buffer_size (
int
, default 1000) — size of the buffer.
Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1,000, then shuffle will
initially select a random element from only the first 1,000 elements in the buffer. Once an element is
selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element,
maintaining the 1,000 element buffer.
If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using datasets.IterableDataset.skip() or datasets.IterableDataset.take() then the order of the shards is kept unchanged.
with_format
< source >( type: typing.Optional[str] = None )
Return a dataset with the specified format. This method only supports the “torch” format for now. The format is set to all the datasets of the dataset dictionary.
cast
< source >( features: Features ) → IterableDatasetDict
Parameters
-
features (datasets.Features) — New features to cast the dataset to.
The name of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. string <-> ClassLabel you should use
map
to update the Dataset.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary.
cast_column
< source >( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] ) → IterableDatasetDict
Parameters
Returns
Cast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary.
remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] ) → IterableDatasetDict
Parameters
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary.
rename_column
< source >( original_column_name: str new_column_name: str ) → IterableDatasetDict
Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary.
rename_columns
< source >( column_mapping: typing.Dict[str, str] ) → IterableDatasetDict
Parameters
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary.
Features[[datasets.Features]]
A special dictionary that defines the internal structure of a dataset.
Instantiated with a dictionary of type dict[str, FieldType]
, where keys are the desired column names,
and values are the type of that column.
FieldType
can be one of the following:
a datasets.Value feature specifies a single typed value, e.g.
int64
orstring
a datasets.ClassLabel feature specifies a field with a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset
a python
dict
which specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It’s possible to have nested fields of nested fields in an arbitrary mannera python
list
or a datasets.Sequence specifies that the field contains a list of objects. The pythonlist
or datasets.Sequence should be provided with a single sub-feature as an example of the feature type hosted in this listA datasets.Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python
list
instead of the datasets.Sequence.a Array2D, Array3D, Array4D or Array5D feature for multidimensional arrays
an Audio feature to store the absolute path to an audio file or a dictionary with the relative path to an audio file (“path” key) and its bytes content (“bytes” key). This feature extracts the audio data.
an Image feature to store the absolute path to an image file, an
np.ndarray
object, aPIL.Image.Image
object or a dictionary with the relative path to an image file (“path” key) and its bytes content (“bytes” key). This feature extracts the image data.datasets.Translation and datasets.TranslationVariableLanguages, the two features specific to Machine Translation
Make a deep copy of Features.
decode_batch
< source >( batch: dict )
Decode batch with custom feature decoding.
decode_column
< source >( column: list column_name: str )
Decode column with custom feature decoding.
Decode example with custom feature decoding.
encode_batch
< source >( batch )
Encode batch into a format for Arrow.
Encode example into a format for Arrow.
Flatten the features. Every dictionary column is removed and is replaced by all the subfields it contains. The new fields are named by concatenating the name of the original column and the subfield name like this: ”<original>.<subfield>“.
If a column contains nested dictionaries, then all the lower-level subfields names are also concatenated to form new columns: ”<original>.<subfield>.<subsubfield>”, etc.
from_arrow_schema
< source >( pa_schema: Schema ) → Features
Construct Features from Arrow Schema. It also checks the schema metadata for Hugging Face Datasets features.
from_dict
< source >( dic ) → Features
Construct Features from dict.
Regenerate the nested feature object from a deserialized dict. We use the ‘_type’ key to infer the dataclass name of the feature FieldType.
It allows for a convenient constructor syntax to define features from deserialized JSON dictionaries. This function is used in particular when deserializing a DatasetInfo that was dumped to a JSON object. This acts as an analogue to Features.from_arrow_schema() and handles the recursive field-by-field instantiation, but doesn’t require any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes that Value automatically performs.
Example:
>>> Features.from_dict({'_type': {'dtype': 'string', 'id': None, '_type': 'Value'}})
{'_type': Value(dtype='string', id=None)}
reorder_fields_as
< source >( other: Features ) → Features
Reorder Features fields to match the field order of other Features.
The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match.
Example:
>>> from datasets import Features, Sequence, Value
>>> # let's say we have to features with a different order of nested fields (for a and b for example)
>>> f1 = Features({"root": Sequence({"a": Value("string"), "b": Value("string")})})
>>> f2 = Features({"root": {"b": Sequence(Value("string")), "a": Sequence(Value("string"))}})
>>> assert f1.type != f2.type
>>> # re-ordering keeps the base structure (here Sequence is defined at the root level), but make the fields order match
>>> f1.reorder_fields_as(f2)
{'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)}
>>> assert f1.reorder_fields_as(f2).type == f2.type
class datasets.Sequence
< source >( feature: typing.Any length: int = -1 id: typing.Optional[str] = None )
Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.
class datasets.ClassLabel
< source >( num_classes: int = None names: typing.List[str] = None names_file: dataclasses.InitVar[typing.Optional[str]] = None id: typing.Optional[str] = None )
Parameters
Feature type for integer class labels.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:
- num_classes: Create 0 to (num_classes-1) labels.
- names: List of label strings.
- names_file: File containing the list of labels.
Conversion integer => class name string.
Conversion class name string => integer.
The Value dtypes are as follows:
null bool int8 int16 int32 int64 uint8 uint16 uint32 uint64 float16 float32 (alias float) float64 (alias double) time32[(s|ms)] time64[(us|ns)] timestamp[(s|ms|us|ns)] timestamp[(s|ms|us|ns), tz=(tzstring)] date32 date64 duration[(s|ms|us|ns)] decimal128(precision, scale) decimal256(precision, scale) binary large_binary string large_string
class datasets.Translation
< source >( languages: typing.List[str] id: typing.Optional[str] = None )
FeatureConnector for translations with fixed languages per example. Here for compatiblity with tfds.
Input: The Translate feature accepts a dictionary for each example mapping string language codes to string translations.
Output: A dictionary mapping string language codes to translations as Text features.
Example:
<h1 id="at-construction-time">At construction time:</h1>
datasets.features.Translation(languages=['en', 'fr', 'de'])
<h1 id="during-data-generation">During data generation:</h1>
yield {
'en': 'the cat',
'fr': 'le chat',
'de': 'die katze'
}
class datasets.TranslationVariableLanguages
< source >( languages: typing.Optional[typing.List] = None num_languages: typing.Optional[int] = None id: typing.Optional[str] = None )
FeatureConnector for translations with variable languages per example. Here for compatiblity with tfds.
Input: The TranslationVariableLanguages feature accepts a dictionary for each example mapping string language codes to one or more string translations. The languages present may vary from example to example.
Output: language: variable-length 1D tf.Tensor of tf.string language codes, sorted in ascending order. translation: variable-length 1D tf.Tensor of tf.string plain text translations, sorted to align with language codes.
Example:
<h1 id="at-construction-time">At construction time:</h1>
datasets.features.Translation(languages=['en', 'fr', 'de'])
<h1 id="during-data-generation">During data generation:</h1>
yield {
'en': 'the cat',
'fr': ['le chat', 'la chatte,']
'de': 'die katze'
}
<h1 id="tensor-returned">Tensor returned :</h1>
{
'language': ['en', 'de', 'fr', 'fr'],
'translation': ['the cat', 'die katze', 'la chatte', 'le chat'],
}
class datasets.Audio
< source >( sampling_rate: typing.Optional[int] = None mono: bool = True decode: bool = True id: typing.Optional[str] = None )
Parameters
-
sampling_rate (
int
, optional) — Target sampling rate. If None, the native sampling rate is used. -
mono (
bool
, defaultTrue
) — Whether to convert the audio signal to mono by averaging samples across channels. -
decode (
bool
, defaultTrue
) — Whether to decode the audio data. If False, returns the underlying dictionary in the format {“path”: audio_path, “bytes”: audio_bytes}.
Audio Feature to extract audio data from an audio file.
Input: The Audio feature accepts as input:
A
str
: Absolute path to the audio file (i.e. random access is allowed).A
dict
with the keys:- path: String with relative path of the audio file to the archive file.
- bytes: Bytes content of the audio file.
This is useful for archived files with sequential access.
A
dict
with the keys:- path: String with relative path of the audio file to the archive file.
- array: Array containing the audio sample
- sampling_rate: Integer corresponding to the samping rate of the audio sample.
This is useful for archived files with sequential access.
cast_storage
< source >( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray] ) → pa.StructArray
Cast an Arrow array to the Audio arrow storage type. The Arrow types that can be converted to the Audio pyarrow storage type are:
- pa.string() - it must contain the “path” data
- pa.struct({“bytes”: pa.binary()})
- pa.struct({“path”: pa.string()})
- pa.struct({“bytes”: pa.binary(), “path”: pa.string()}) - order doesn’t matter
decode_example
< source >( value: dict )
Decode example audio file into audio data.
embed_storage
< source >( storage: StructArray drop_paths: bool = True ) → pa.StructArray
Embed audio files into the Arrow array.
encode_example
< source >(
value: typing.Union[str, dict]
)
→
dict
Encode example into a format for Arrow.
class datasets.Image
< source >( decode: bool = True id: typing.Optional[str] = None )
Image feature to read image data from an image file.
Input: The Image feature accepts as input:
A
str
: Absolute path to the image file (i.e. random access is allowed).A
dict
with the keys:- path: String with relative path of the image file to the archive file.
- bytes: Bytes of the image file.
This is useful for archived files with sequential access.
- An
np.ndarray
: NumPy array representing an image. - A
PIL.Image.Image
: PIL image object.
cast_storage
< source >( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray] ) → pa.StructArray
Cast an Arrow array to the Image arrow storage type. The Arrow types that can be converted to the Image pyarrow storage type are:
- pa.string() - it must contain the “path” data
- pa.struct({“bytes”: pa.binary()})
- pa.struct({“path”: pa.string()})
- pa.struct({“bytes”: pa.binary(), “path”: pa.string()}) - order doesn’t matter
- pa.list(*) - it must contain the image array data
decode_example
< source >( value: dict )
Decode example image file into image data.
embed_storage
< source >( storage: StructArray drop_paths: bool = True ) → pa.StructArray
Embed image files into the Arrow array.
encode_example
< source >( value: typing.Union[str, dict, numpy.ndarray, ForwardRef('PIL.Image.Image')] )
Encode example into a format for Arrow.
MetricInfo[[datasets.MetricInfo]]
class datasets.MetricInfo
< source >( description: str citation: str features: Features inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None metric_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Information about a metric.
MetricInfo
documents a metric, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
Create MetricInfo from the JSON file in metric_info_dir
.
Write MetricInfo
as JSON to metric_info_dir
.
Also save the license separately in LICENCE.
Metric[[datasets.Metric]]
The base class Metric
implements a Metric backed by one or several datasets.Dataset.
class datasets.Metric
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
-
config_name (
str
) — This is used to define a hash specific to a metrics computation script and prevents the metric’s data to be overridden when the metric loading script is modified. -
keep_in_memory (
bool
) — keep all predictions and references in memory. Not possible in distributed settings. -
cache_dir (
str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups. -
num_process (
int
) — specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). -
process_id (
int
) — specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). -
seed (
int
, optional) — If specified, this will temporarily set numpy’s random seed when datasets.Metric.compute() is run. -
experiment_id (
str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). -
max_concurrent_cache_files (
int
) — Max number of concurrent metrics cache files (default 10000). -
timeout (
Union[int, float]
) — Timeout in second for distributed setting synchronization.
A Metric is the base class and common API for all metrics.
add
< source >( prediction = None reference = None **kwargs )
Add one prediction and reference for the metric’s stack.
add_batch
< source >( predictions = None references = None **kwargs )
Add a batch of predictions and references for the metric’s stack.
compute
< source >( predictions = None references = None **kwargs )
Compute the metrics.
Usage of positional arguments is not allowed to prevent mistakes.
download_and_prepare
< source >( download_config: typing.Optional[datasets.utils.file_utils.DownloadConfig] = None dl_manager: typing.Optional[datasets.utils.download_manager.DownloadManager] = None )
Parameters
- download_config (DownloadConfig, optional) — Specific download configuration parameters.
- dl_manager (DownloadManager, optional) — Specific download manager to use.
Downloads and prepares dataset for reading.
Filesystems[[datasets.filesystems.S3FileSystem]]
Access S3 as if it were a file system.
This exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage.
Provide credentials either explicitly (key=
, secret=
) or depend
on boto’s credential methods. See botocore documentation for more
information. If no credentials are available, use anon=True
.
The following parameters are passed on to fsspec:
skip_instance_cache: to control reuse of instances use_listings_cache, listings_expiry_time, max_paths: to control reuse of directory listings
.
datasets.filesystems.S3FileSystem
is a subclass of s3fs.S3FileSystem](https://s3fs.readthedocs.io/en/latest/api.html), which is a known
implementation of fsspec
. Filesystem Spec FSSPEC is a project to
unify various projects and classes to work with remote filesystems
and file-system-like abstractions using a standard pythonic interface.
Examples:
Listing files from public s3 bucket.
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True) # doctest: +SKIP
>>> s3.ls('public-datasets/imdb/train') # doctest: +SKIP
['dataset_info.json.json','dataset.arrow','state.json']
Listing files from private s3 bucket using aws_access_key_id
and aws_secret_access_key
.
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP
>>> s3.ls('my-private-datasets/imdb/train') # doctest: +SKIP
['dataset_info.json.json','dataset.arrow','state.json']
Using S3Filesystem
with botocore.session.Session
and custom aws_profile
.
>>> import botocore
>>> from datasets.filesystems import S3Filesystem
>>> s3_session = botocore.session.Session(profile_name='my_profile_name')
>>>
>>> s3 = S3FileSystem(session=s3_session) # doctest: +SKIP
Loading dataset from s3 using S3Filesystem
and load_from_disk()
.
>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3Filesystem
>>>
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP
>>>
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP
>>>
>>> print(len(dataset))
25000
Saving dataset to s3 using <code>S3Filesystem<code/> and <code>dataset.save_to_disk()<code/>.
>>> from datasets import load_dataset
>>> from datasets.filesystems import S3Filesystem
>>>
>>> dataset = load_dataset("imdb")
>>>
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) # doctest: +SKIP
>>>
>>> dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) # doctest: +SKIP
Parameters
anon : bool (False)
Whether to use anonymous connection (public buckets only). If False,
uses the key/secret given, or boto’s credential resolver (client_kwargs,
environment, variables, config files, EC2 IAM server, in that order)
key : string (None)
If not anonymous, use this access key ID, if specified
secret : string (None)
If not anonymous, use this secret access key, if specified
token : string (None)
If not anonymous, use this security token, if specified
use_ssl : bool (True)
Whether to use SSL in connections to S3; may be faster without, but
insecure. If use_ssl
is also set in client_kwargs
,
the value set in client_kwargs
will take priority.
s3_additional_kwargs : dict of parameters that are used when calling s3 api
methods. Typically used for things like “ServerSideEncryption”.
client_kwargs : dict of parameters for the botocore client
requester_pays : bool (False)
If RequesterPays buckets are supported.
default_block_size: int (None)
If given, the default block size value used for open()
, if no
specific value is given at all time. The built-in default is 5MB.
default_fill_cache : Bool (True)
Whether to use cache filling with open by default. Refer to
S3File.open
.
default_cache_type : string (‘bytes’)
If given, the default cache_type value used for open()
. Set to “none”
if no caching is desired. See fsspec’s documentation for other available
cache_type values. Default cache_type is ‘bytes’.
version_aware : bool (False)
Whether to support bucket versioning. If enable this will require the
user to have the necessary IAM permissions for dealing with versioned
objects.
cache_regions : bool (False)
Whether to cache bucket regions or not. Whenever a new bucket is used,
it will first find out which region it belongs and then use the client
for that region.
config_kwargs : dict of parameters passed to botocore.client.Config
kwargs : other parameters for core session
session : aiobotocore AioSession object to be used for all connections.
This session will be used inplace of creating a new session inside S3FileSystem.
For example: aiobotocore.session.AioSession(profile=‘test_user’)
datasets.filesystems.extract_path_from_uri
< source >( dataset_path: str )
preprocesses dataset_path and removes remote filesystem (e.g. removing s3://
)
datasets.filesystems.is_remote_filesystem
< source >( fs: AbstractFileSystem )
Parameters
-
fs (
fsspec.spec.AbstractFileSystem
) — An abstract super-class for pythonic file-systems, e.g. :code:fsspec.filesystem(‘file’) or datasets.filesystems.S3FileSystem
Validates if filesystem has remote protocol.
Fingerprint[[datasets.fingerprint.Hasher]]
Hasher that accepts python objects as inputs.