Main classes¶
DatasetInfo
¶
-
class
datasets.
DatasetInfo
(description: str = <factory>, citation: str = <factory>, homepage: str = <factory>, license: str = <factory>, features: Optional[datasets.features.Features] = None, post_processed: Optional[datasets.info.PostProcessedInfo] = None, supervised_keys: Optional[datasets.info.SupervisedKeysData] = None, builder_name: Optional[str] = None, config_name: Optional[str] = None, version: Optional[Union[str, datasets.utils.version.Version]] = None, splits: Optional[dict] = None, download_checksums: Optional[dict] = None, download_size: Optional[int] = None, post_processing_size: Optional[int] = None, dataset_size: Optional[int] = None, size_in_bytes: Optional[int] = None)[source]¶ Information about a dataset.
DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
-
classmethod
from_directory
(dataset_info_dir: dict) → datasets.info.DatasetInfo[source]¶ Create DatasetInfo from the JSON file in dataset_info_dir.
This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.
This will overwrite all previous metadata.
- Parameters
dataset_info_dir – str The directory containing the metadata file. This should be the root directory of a specific dataset version.
-
classmethod
Dataset
¶
The base class datasets.Dataset
implements a Dataset backed by an Apache Arrow table.
-
class
datasets.
Dataset
(arrow_table: pyarrow.lib.Table, data_files: Optional[List[dict]] = None, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_table: Optional[pyarrow.lib.Table] = None, indices_data_files: Optional[List[dict]] = None, fingerprint: Optional[str] = None, inplace_history: Optional[List[dict]] = None)[source]¶ A Dataset backed by an Arrow table or Record Batch.
-
__getitem__
(key: Union[int, slice, str]) → Union[Dict, List][source]¶ Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)
-
__iter__
()[source]¶ Iterate through the examples. If a formatting is set with
datasets.Dataset.set_format()
rows will be returned with the selected format.
-
add_elasticsearch_index
(column: str, index_name: Optional[str] = None, host: Optional[str] = None, port: Optional[int] = None, es_client: Optional[elasticsearch.Elasticsearch] = None, es_index_name: Optional[str] = None, es_index_config: Optional[dict] = None)[source]¶ Add a text index using ElasticSearch for fast retrieval. This is done in-place.
- Parameters
column (
str
) – The column of the documents to add to the index.index_name (Optional
str
) – The index_name/identifier of the index. This is the index name that is used to calldatasets.Dataset.get_nearest_examples()
ordatasets.Dataset.search()
. By default it corresponds tocolumn
.host (Optional
str
, defaults to localhost) – host of where ElasticSearch is runningport (Optional
str
, defaults to 9200) – port of where ElasticSearch is runninges_client (Optional
elasticsearch.Elasticsearch
) – The elasticsearch client used to create the index if host and port are None.es_index_name (Optional
str
) – The elasticsearch index name used to create the index.es_index_config (Optional
dict
) – The configuration of the elasticsearch index. Default config is:
Config:
{ "settings": { "number_of_shards": 1, "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, }, "mappings": { "properties": { "text": { "type": "text", "analyzer": "standard", "similarity": "BM25" }, } }, }
Example:
es_client = elasticsearch.Elasticsearch() ds = datasets.load_dataset('crime_and_punish', split='train') ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index") scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
-
add_faiss_index
(column: str, index_name: Optional[str] = None, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]¶ Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify
device
if you want to run it on GPU (device
must be the GPU index). You can find more information about Faiss here:For string factory
- Parameters
column (
str
) – The column of the vectors to add to the index.index_name (Optional
str
) – The index_name/identifier of the index. This is the index_name that is used to calldatasets.Dataset.get_nearest_examples()
ordatasets.Dataset.search()
. By default it corresponds to column.device (Optional
int
) – If not None, this is the index of the GPU to use. By default it uses the CPU.string_factory (Optional
str
) – This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat
.metric_type (Optional
int
) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.custom_index (Optional
faiss.Index
) – Custom Faiss index that you already have instantiated and configured for your needs.train_size (Optional
int
) – If the index needs a training step, specifies how many vectors will be used to train the index.faiss_verbose (
bool
, defaults to False) – Enable the verbosity of the Faiss index.dtype (data-type) – The dtype of the numpy arrays that are indexed. Default is
np.float32
.
Example:
ds = datasets.load_dataset('crime_and_punish', split='train') ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']})) ds_with_embeddings.add_faiss_index(column='embeddings') # query scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10) # save index ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss') ds = datasets.load_dataset('crime_and_punish', split='train') # load index ds.load_faiss_index('embeddings', 'my_index.faiss') # query scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)
-
add_faiss_index_from_external_arrays
(external_arrays: numpy.array, index_name: str, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]¶ Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here: - For string factory
- Parameters
external_arrays (
np.array
) – If you want to use arrays from outside the lib for the index, you can setexternal_arrays
. It will useexternal_arrays
to create the Faiss index instead of the arrays in the givencolumn
.index_name (
str
) – The index_name/identifier of the index. This is the index_name that is used to calldatasets.Dataset.get_nearest_examples()
ordatasets.Dataset.search()
.device (Optional
int
) – If not None, this is the index of the GPU to use. By default it uses the CPU.string_factory (Optional
str
) – This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat
.metric_type (Optional
int
) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.custom_index (Optional
faiss.Index
) – Custom Faiss index that you already have instantiated and configured for your needs.train_size (Optional
int
) – If the index needs a training step, specifies how many vectors will be used to train the index.faiss_verbose (
bool
, defaults to False) – Enable the verbosity of the Faiss index.dtype (
numpy.dtype
) – The dtype of the numpy arrays that are indexed. Default is np.float32.
-
property
cache_files
¶ The cache file containing the Apache Arrow table backing the dataset.
-
cast_
(features: datasets.features.Features)[source]¶ Cast the dataset to a new set of features.
You can also remove a column using
Dataset.map()
with feature butcast_()
is in-place (doesn’t copy the data to a new dataset) and is thus faster.- Parameters
features (
datasets.Features
) – New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should usemap()
to update the Dataset.
-
cleanup_cache_files
()[source]¶ Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be carefull when running this command that no other process is currently using other cache files.
- Returns
Number of removed files
-
property
column_names
¶ Names of the columns in the dataset.
-
property
data
¶ The Apache Arrow table backing the dataset.
-
drop_index
(index_name: str)¶ Drop the index with the specified column.
- Parameters
index_name (
str
) – The index_name/identifier of the index.
-
export
(filename: str, format: str = 'tfrecord')[source]¶ Writes the Arrow dataset to a TFRecord file.
The dataset must already be in tensorflow format. The records will be written with keys from dataset._format_columns.
- Parameters
filename (str) – The filename, including the .tfrecord extension, to write to.
(Optional[str], default (format) – “tfrecord”): The type of output file. Currently this is a no-op, as TFRecords are the only option. This enables a more flexible function signature later.
-
filter
(function: Optional[Callable] = None, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None, suffix_template: str = '_{rank:05d}_of_{num_proc:05d}', new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
- Parameters
function (callable) – with one of the following signature: - function(example: Union[Dict, Any]) -> bool if with_indices=False - function(example: Union[Dict, Any], indices: int) -> bool if with_indices=True If no function is provided, default to an always True function: lambda x: True
with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function
remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function
num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
(str, defaults to "_{rank (suffix_template) – 05d}_of_{num_proc:05d}”): If cache_file_name is specified, then this suffix will be added at the end of the base name of each. For example, if cache_file_name is “processed.arrow”, then for rank=1 and num_proc=4, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix.
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
flatten_
(max_depth=16)[source]¶ Flatten the Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
-
flatten_indices
(keep_in_memory: bool = False, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = True, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Create and cache a new Dataset by flattening the indices mapping.
- Parameters
(bool, default (disable_nullable) – False): Keep the dataset in memory instead of writing it to a cache file.
cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
(Optional[datasets.Features], default (features) – None): Use a specific Features to store the cache file instead of the automatically generated one.
(bool, default – True): Allow null values in the table.
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
formatted_as
(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]¶ To be used in a with statement. Set __getitem__ return format (type and columns)
- Parameters
type (Optional
str
) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)columns (Optional
List[str]
) – columns to format in the output None means __getitem__ returns all columns (default)output_all_columns (
bool
default to False) – keep un-formatted columns as well in the output (as python objects)format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
-
classmethod
from_buffer
(buffer: pyarrow.lib.Buffer, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_buffer: Optional[pyarrow.lib.Buffer] = None) → datasets.arrow_dataset.Dataset[source]¶ Instantiate a Dataset backed by an Arrow buffer
-
classmethod
from_dict
(mapping: dict, features: Optional[datasets.features.Features] = None, info: Optional[Any] = None, split: Optional[Any] = None) → datasets.arrow_dataset.Dataset[source]¶ Convert :obj:
dict
to a “obj”pyarrow.Table
to create a :obj:datasets.Dataset
.- Parameters
mapping (:obj:
mapping
) – A mapping of strings to Arrays or Python lists.features (:obj:
datasets.Features
, optional, defaults to :obj:None
) – If specified, the features types of the datasetinfo (:obj:
datasets.DatasetInfo
, optional, defaults to :obj:None
) – If specified, the dataset info containing info like description, citation, etc.split (:obj:
datasets.NamedSplit
, optional, defaults to :obj:None
) – If specified, the name of the dataset split.
-
classmethod
from_file
(filename: str, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None, indices_filename: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Instantiate a Dataset backed by an Arrow table at filename
-
classmethod
from_pandas
(df: pandas.core.frame.DataFrame, features: Optional[datasets.features.Features] = None, info: Optional[datasets.info.DatasetInfo] = None, split: Optional[datasets.splits.NamedSplit] = None) → datasets.arrow_dataset.Dataset[source]¶ Convert :obj:
pandas.DataFrame
to a “obj”pyarrow.Table
to create a :obj:datasets.Dataset
.The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing explicit features and passing it to this function.
- Parameters
df (:obj:
pandas.DataFrame
) – the dataframe that contains the dataset.features (:obj:
datasets.Features
, optional, defaults to :obj:None
) – If specified, the features types of the datasetinfo (:obj:
datasets.DatasetInfo
, optional, defaults to :obj:None
) – If specified, the dataset info containing info like description, citation, etc.split (:obj:
datasets.NamedSplit
, optional, defaults to :obj:None
) – If specified, the name of the dataset split.
-
get_index
(index_name: str) → datasets.search.BaseIndex¶ List the index_name/identifiers of all the attached indexes.
-
get_nearest_examples
(index_name: str, query: Union[str, numpy.array], k: int = 10) → datasets.search.NearestExamplesResults¶ Find the nearest examples in the dataset to the query.
- Parameters
index_name (
str
) – The index_name/identifier of the index.query (
Union[str, np.ndarray]
) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.k (
int
) – The number of examples to retrieve.
- Ouput:
scores (
List[float]
): The retrieval scores of the retrieved examples. examples (dict
): The retrieved examples.
-
get_nearest_examples_batch
(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → datasets.search.BatchedNearestExamplesResults¶ Find the nearest examples in the dataset to the query.
- Parameters
index_name (
str
) – The index_name/identifier of the index.queries (
Union[List[str], np.ndarray]
) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.k (
int
) – The number of examples to retrieve per query.
- Ouput:
total_scores (List[List[float]): The retrieval scores of the retrieved examples per query. total_examples (List[dict]): The retrieved examples per query.
-
property
info
¶ datasets.DatasetInfo
object containing all the metadata in the dataset.
-
list_indexes
() → List[str]¶ List the colindex_nameumns/identifiers of all the attached indexes.
-
load_elasticsearch_index
(index_name: str, es_index_name: str, host: Optional[str] = None, port: Optional[int] = None, es_client: Optional[Elasticsearch] = None, es_index_config: Optional[dict] = None)¶ Load an existing text index using ElasticSearch for fast retrieval.
- Parameters
index_name (
str
) – The index_name/identifier of the index. This is the index name that is used to call .get_nearest or .search.(` (es_index_name) – obj:str`): The name of elasticsearch index to load.
host (Optional
str
, defaults to localhost) – host of where ElasticSearch is running(Optional ` (port) – obj:str`, defaults to 9200): port of where ElasticSearch is running
es_client (Optional
elasticsearch.Elasticsearch
) – The elasticsearch client used to create the index if host and port are None.es_index_config (Optional
dict
) – The configuration of the elasticsearch index. Default config is:
Config:
{ "settings": { "number_of_shards": 1, "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}}, }, "mappings": { "properties": { "text": { "type": "text", "analyzer": "standard", "similarity": "BM25" }, } }, }
-
load_faiss_index
(index_name: str, file: str, device: Optional[int] = None)¶ Load a FaissIndex from disk. If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index to make it fit your needs
- Parameters
index_name (
str
) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.file (
str
) – The path to the serialized faiss index on disk.device (Optional
int
) – If not None, this is the index of the GPU to use. By default it uses the CPU.
-
static
load_from_disk
(dataset_path: str) → datasets.arrow_dataset.Dataset[source]¶ Load the dataset from a dataset directory
- Parameters
dataset_path (
str
) – path of the dataset directory where the dataset will be loaded from
-
map
(function: Optional[Callable] = None, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, batched: bool = False, batch_size: Optional[int] = 1000, drop_last_batch: bool = False, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = False, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None, suffix_template: str = '_{rank:05d}_of_{num_proc:05d}', new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples).
- Parameters
function (callable) – with one of the following signature: - function(example: Union[Dict, Any]) -> Union[Dict, Any] if batched=False and with_indices=False - function(example: Union[Dict, Any], indices: int) -> Union[Dict, Any] if batched=False and with_indices=True - function(batch: Union[Dict[List], List[Any]]) -> Union[Dict, Any] if batched=True and with_indices=False - function(batch: Union[Dict[List], List[Any]], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True If no function is provided, default to identity function: lambda x: x
with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
batched (bool, defaults to False) – Provide batch of examples to function
batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function
(bool, default (drop_last_batch) – False): Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
features (Optional[datasets.Features], defaults to None) – Use a specific Features to store the cache file instead of the automatically generated one.
disable_nullable (bool, defaults to True) – Disallow null values in the table.
fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function
num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
(str, defaults to "_{rank (suffix_template) – 05d}_of_{num_proc:05d}”): If cache_file_name is specified, then this suffix will be added at the end of the base name of each. For example, if cache_file_name is “processed.arrow”, then for rank=1 and num_proc=4, the resulting file would be “processed_00001_of_00004.arrow” for the default suffix.
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
property
num_columns
¶ Number of columns in the dataset.
-
property
num_rows
¶ Number of rows in the dataset (same as
datasets.Dataset.__len__()
).
-
remove_columns_
(column_names: Union[str, List[str]])[source]¶ Remove one or several column(s) in the dataset and the features associated to them.
You can also remove a column using
Dataset.map()
with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.- Parameters
column_names (
Union[str, List[str]]
) – Name of the column(s) to remove.
-
rename_column_
(original_column_name: str, new_column_name: str)[source]¶ Rename a column in the dataset and move the features associated to the original column under the new column name.
- You can also rename a column using
Dataset.map()
with remove_columns but the present method: takes care of moving the original features under the new column name.
doesn’t copy the data to a new dataset and is thus much faster.
- Parameters
original_column_name (
str
) – Name of the column to rename.new_column_name (
str
) – New name for the column.
- You can also rename a column using
-
reset_format
()[source]¶ Reset __getitem__ return format to python objects and all columns.
Same as
self.set_format()
-
save_faiss_index
(index_name: str, file: str)¶ Save a FaissIndex on disk
- Parameters
index_name (
str
) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.file (
str
) – The path to the serialized faiss index on disk.
-
save_to_disk
(dataset_path: str)[source]¶ Save the dataset in a dataset directory
- Parameters
dataset_path (
str
) – path of the dataset directory where the dataset will be saved to
-
search
(index_name: str, query: Union[str, numpy.array], k: int = 10) → datasets.search.SearchResults¶ Find the nearest examples indices in the dataset to the query.
- Parameters
index_name (
str
) – The name/identifier of the index.query (
Union[str, np.ndarray]
) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.k (
int
) – The number of examples to retrieve.
- Ouput:
scores (
List[List[float]
): The retrieval scores of the retrieved examples. indices (List[List[int]]
): The indices of the retrieved examples.
-
search_batch
(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → datasets.search.BatchedSearchResults¶ Find the nearest examples indices in the dataset to the query.
- Parameters
index_name (
str
) – The index_name/identifier of the index.queries (
Union[List[str], np.ndarray]
) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.(` (k) – obj:int`): The number of examples to retrieve per query.
- Ouput:
total_scores (
List[List[float]
): The retrieval scores of the retrieved examples per query. total_indices (List[List[int]]
): The indices of the retrieved examples per query.
-
select
(indices: collections.abc.Iterable, keep_in_memory: bool = False, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Create a new dataset with rows selected following the list/array of indices.
- Parameters
indices (sequence, iterable, ndarray or Series) – List or 1D-array of integer indices for indexing.
(bool, default (keep_in_memory) – False): Keep the indices mapping in memory instead of writing it to a cache file.
(Optional[str], default (indices_cache_file_name) – None): Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name.
(int, default (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
set_format
(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]¶ Set __getitem__ return format (type and columns)
- Parameters
type (Optional
str
) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)columns (Optional
List[str]
) – columns to format in the output None means __getitem__ returns all columns (default)output_all_columns (
bool
default to False) – keep un-formatted columns as well in the output (as python objects)format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
-
property
shape
¶ Shape of the dataset (number of columns, number of rows).
-
shard
(num_shards: int, index: int, contiguous: bool = False, keep_in_memory: bool = False, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000) → datasets.arrow_dataset.Dataset[source]¶ Return the index-nth shard from dataset split into num_shards pieces.
This shards deterministically. dset.shard(n, i) will contain all elements of dset whose index mod n = i.
dset.shard(n, i, contiguous=True) will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l, then the first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i). datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)]) will return a dataset with the same order as the original.
Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.
- Parameters
num_shards (int) – How many shards to split the dataset into.
index (int) – Which shard to select and return.
contiguous – (bool, defaults to False): Whether to select contiguous blocks of indices for shards.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
indices_cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
-
shuffle
(seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Create a new Dataset where the rows are shuffled.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
- Parameters
seed (Optional int) – A seed to initialize the default BitGenerator if
generator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.generator (Optional np.random.Generator) – Numpy random Generator to use to compute the permutation of the dataset rows. If
generator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).keep_in_memory (bool, defaults to False) – Keep the shuffled indices in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the shuffled indices can be identified, use it instead of recomputing.
indices_cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
sort
(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, new_fingerprint: Optional[str] = None) → datasets.arrow_dataset.Dataset[source]¶ Create a new dataset sorted according to a column.
Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).
- Parameters
column (str) – column name to sort by.
reverse – (bool, defaults to False): If True, sort by descending order rather then ascending.
kind (Optional str) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.
keep_in_memory (bool, defaults to False) – Keep the sorted indices in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the sorted indices can be identified, use it instead of recomputing.
indices_cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the sorted indices instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the dataset after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
property
split
¶ datasets.NamedSplit
object corresponding to a named dataset split.
-
train_test_split
(test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, train_indices_cache_file_name: Optional[str] = None, test_indices_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, train_new_fingerprint: Optional[str] = None, test_new_fingerprint: Optional[str] = None) → DatasetDict[source]¶ Return a dictionary (
datasets.DatsetDict
) with two random train and test subsets (train and testDataset
splits). Splits are created from the dataset according to test_size, train_size and shuffle.This method is similar to scikit-learn train_test_split with the omission of the stratified options.
- Parameters
test_size (Optional np.random.Generator) – Size of the test split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size (Optional np.random.Generator) – Size of the train split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
shuffle (Optional bool, defaults to True) – Whether or not to shuffle the data before splitting.
seed (Optional int) – A seed to initialize the default BitGenerator if
generator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.generator (Optional np.random.Generator) – Numpy random Generator to use to compute the permutation of the dataset rows. If
generator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).keep_in_memory (bool, defaults to False) – Keep the splits indices in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the splits indices can be identified, use it instead of recomputing.
train_cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the train split indices instead of the automatically generated cache file name.
test_cache_file_name (Optional[str], defaults to None) – Provide the name of a path for the cache file. It is used to store the test split indices instead of the automatically generated cache file name.
writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
train_new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the train set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
test_new_fingerprint (Optional[str], defaults to None) – the new fingerprint of the test set after transform. If None, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
-
unique
(column: str) → List[Any][source]¶ Return a list of the unique elements in a column.
This is implemented in the low-level backend and as such, very fast.
- Parameters
column (
str
) – column name (list all the column names withdatasets.Dataset.column_names()
)
Returns:
list
of unique elements in the given column.
-
-
datasets.
concatenate_datasets
(dsets: List[datasets.arrow_dataset.Dataset], info: Optional[Any] = None, split: Optional[Any] = None)[source]¶ Converts a list of :obj:
datasets.Dataset
with the same schema into a single :obj:datasets.Dataset
.- Parameters
dsets (:obj:
List[datasets.Dataset]
) – A list of Datasets to concatenateinfo (:obj:
datasets.DatasetInfo
, optional, defaults to :obj:None
) – If specified, the dataset info containing info like description, citation, etc.split (:obj:
datasets.NamedSplit
, optional, defaults to :obj:None
) – If specified, the name of the dataset split.
DatasetDict
¶
Dictionary with split names as keys (‘train’, ‘test’ for example), and datasets.Dataset
objects as values.
It also has dataset transform methods like map or filter, to process all the splits at once.
-
class
datasets.
DatasetDict
[source]¶ A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)
-
property
cache_files
¶ The cache files containing the Apache Arrow table backing each split.
-
cast_
(features: datasets.features.Features)[source]¶ Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.
You can also remove a column using
Dataset.map()
with feature butcast_()
is in-place (doesn’t copy the data to a new dataset) and is thus faster.- Parameters
features (
datasets.Features
) – New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should usemap()
to update the Dataset.
-
cleanup_cache_files
() → Dict[str, int][source]¶ Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be carefull when running this command that no other process is currently using other cache files.
- Returns
Dict with the number of removed files for each split
-
property
column_names
¶ Names of the columns in each split of the dataset.
-
property
data
¶ The Apache Arrow tables backing each split.
-
filter
(function, with_indices=False, input_columns: Optional[Union[str, List[str]]] = None, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None) → datasets.dataset_dict.DatasetDict[source]¶ Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.
- Parameters
function (callable) – with one of the following signature: - function(example: Dict) -> bool if with_indices=False - function(example: Dict, indices: int) -> bool if with_indices=True
with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function
remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one
cache_file_name
per dataset in the dataset dictionary.writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function
num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
-
flatten_
(max_depth=16)[source]¶ Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
-
formatted_as
(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]¶ To be used in a with statement. Set __getitem__ return format (type and columns) The transformation is applied to all the datasets of the dataset dictionary.
- Parameters
type (Optional
str
) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)columns (Optional
List[str]
) – columns to format in the output None means __getitem__ returns all columns (default)output_all_columns (
bool
default to False) – keep un-formatted columns as well in the output (as python objects)format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
-
static
load_from_disk
(dataset_dict_path: str) → datasets.dataset_dict.DatasetDict[source]¶ Load the dataset dict from a dataset dict directory
- Parameters
dataset_dict_path (
str
) – path of the dataset dict directory where the dataset dict will be loaded from
-
map
(function, with_indices: bool = False, input_columns: Optional[Union[str, List[str]]] = None, batched: bool = False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, features: Optional[datasets.features.Features] = None, disable_nullable: bool = False, fn_kwargs: Optional[dict] = None, num_proc: Optional[int] = None) → datasets.dataset_dict.DatasetDict[source]¶ Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.
- Parameters
function (callable) – with one of the following signature: - function(example: Dict) -> Union[Dict, Any] if batched=False and with_indices=False - function(example: Dict, indices: int) -> Union[Dict, Any] if batched=False and with_indices=True - function(batch: Dict[List]) -> Union[Dict, Any] if batched=True and with_indices=False - function(batch: Dict[List], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True
with_indices (bool, defaults to False) – Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….
input_columns (Optional[Union[str, List[str]]], defaults to None) – The columns to be passed into function as positional arguments. If None, a dict mapping to all formatted columns is passed as one argument.
batched (bool, defaults to False) – Provide batch of examples to function
batch_size (Optional[int], defaults to 1000) – Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function
remove_columns (Optional[List[str]], defaults to None) – Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one
cache_file_name
per dataset in the dataset dictionary.writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
features (Optional[datasets.Features], defaults to None) – Use a specific Features to store the cache file instead of the automatically generated one.
disable_nullable (bool, defaults to True) – Disallow null values in the table.
fn_kwargs (Optional[Dict], defaults to None) – Keyword arguments to be passed to function
num_proc (Optional[int], defaults to None) – Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
-
property
num_columns
¶ Number of columns in each split of the dataset.
-
property
num_rows
¶ Number of rows in each split of the dataset (same as
datasets.Dataset.__len__()
).
-
remove_columns_
(column_names: Union[str, List[str]])[source]¶ Remove one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
You can also remove a column using
Dataset.map()
with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.- Parameters
column_names (
Union[str, List[str]]
) – Name of the column(s) to remove.
-
rename_column_
(original_column_name: str, new_column_name: str)[source]¶ Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.
- You can also rename a column using
Dataset.map()
with remove_columns but the present method: takes care of moving the original features under the new column name.
doesn’t copy the data to a new dataset and is thus much faster.
- Parameters
original_column_name (
str
) – Name of the column to rename.new_column_name (
str
) – New name for the column.
- You can also rename a column using
-
reset_format
()[source]¶ Reset __getitem__ return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.
Same as
self.set_format()
-
save_to_disk
(dataset_dict_path: str)[source]¶ Save the dataset dict in a dataset dict directory.
- Parameters
dataset_dict_path (
str
) – path of the dataset dict directory where the dataset dict will be saved to
-
set_format
(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]¶ Set __getitem__ return format (type and columns) The transformation is applied to all the datasets of the dataset dictionary.
- Parameters
type (Optional
str
) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)columns (Optional
List[str]
) – columns to format in the output None means __getitem__ returns all columns (default)output_all_columns (
bool
default to False) – keep un-formatted columns as well in the output (as python objects)format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.
-
property
shape
¶ Shape of each split of the dataset (number of columns, number of rows).
-
shuffle
(seeds: Optional[Union[int, Dict[str, int]]] = None, seed: Optional[int] = None, generators: Optional[Dict[str, numpy.random._generator.Generator]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000)[source]¶ Create a new Dataset where the rows are shuffled. The transformation is applied to all the datasets of the dataset dictionary.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
- Parameters
seeds (Optional Dict[str, int] or int) – A seed to initialize the default BitGenerator if
generator=None
. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide oneseed
per dataset in the dataset dictionary.seed (Optional int) – A seed to initialize the default BitGenerator if
generator=None
. Alias for seeds (the seed argument has priority over seeds if both arguments are provided).generators (Optional Dict[str, np.random.Generator]) – Numpy random Generator to use to compute the permutation of the dataset rows. If
generator=None
(default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). You have to provide onegenerator
per dataset in the dataset dictionary.keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
(Optional[Dict[str, str]], default (indices_cache_file_names) – None): Provide the name of a path for the cache file. It is used to store the indices mappings instead of the automatically generated cache file name. You have to provide one
cache_file_name
per dataset in the dataset dictionary.writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
-
sort
(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, indices_cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000) → datasets.dataset_dict.DatasetDict[source]¶ Create a new dataset sorted according to a column. The transformation is applied to all the datasets of the dataset dictionary.
Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).
- Parameters
column (str) – column name to sort by.
reverse – (bool, defaults to False): If True, sort by descending order rather then ascending.
kind (Optional str) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.
keep_in_memory (bool, defaults to False) – Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (bool, defaults to True) – If a cache file storing the current computation from function can be identified, use it instead of recomputing.
indices_cache_file_names (Optional[Dict[str, str]], defaults to None) – Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. You have to provide one
cache_file_name
per dataset in the dataset dictionary.writer_batch_size (int, defaults to 1000) – Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().
-
unique
(column: str) → Dict[str, List[Any]][source]¶ Return a list of the unique elements in a column for each split.
This is implemented in the low-level backend and as such, very fast.
- Parameters
column (
str
) – column name (list all the column names withdatasets.Dataset.column_names()
)
Returns: Dict[:obj: str,
list
] of unique elements in the given column.
-
property
Features
¶
-
class
datasets.
Sequence
(feature: Any, length: int = - 1, id: Optional[str] = None)[source]¶ Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.
-
class
datasets.
ClassLabel
(num_classes: int = None, names: List[str] = None, names_file: str = None, id: Optional[str] = None)[source]¶ Handle integer class labels. Here for compatiblity with tfds.
There are 3 ways to define a ClassLabel, which correspond to the 3 :param * num_classes: create 0 to (num_classes-1) labels :param * names: a list of label strings :param * names_file: a file containing the list of labels.
Note: On python2, the strings are encoded as utf-8.
- Parameters
num_classes – int, number of classes. All labels must be < num_classes.
names – list<str>, string names for the integer classes. The order in which the names are provided is kept.
names_file – str, path to a file with names for the integer classes, one per line.
-
class
datasets.
Value
(dtype: str, id: Optional[str] = None)[source]¶ Encapsulate an Arrow datatype for easy serialization.
-
class
datasets.
Translation
(languages: List[str], id: Optional[str] = None)[source]¶ FeatureConnector for translations with fixed languages per example. Here for compatiblity with tfds.
- Input: The Translate feature accepts a dictionary for each example mapping
string language codes to string translations.
- Output: A dictionary mapping string language codes to translations as Text
features.
Example:
# At construction time: datasets.features.Translation(languages=['en', 'fr', 'de']) # During data generation: yield { 'en': 'the cat', 'fr': 'le chat', 'de': 'die katze' }
-
class
datasets.
TranslationVariableLanguages
(languages: List = None, num_languages: int = None, id: Optional[str] = None)[source]¶ FeatureConnector for translations with variable languages per example. Here for compatiblity with tfds.
- Input: The TranslationVariableLanguages feature accepts a dictionary for each
example mapping string language codes to one or more string translations. The languages present may vary from example to example.
- Output:
- language: variable-length 1D tf.Tensor of tf.string language codes, sorted
in ascending order.
- translation: variable-length 1D tf.Tensor of tf.string plain text
translations, sorted to align with language codes.
Example:
# At construction time: datasets.features.Translation(languages=['en', 'fr', 'de']) # During data generation: yield { 'en': 'the cat', 'fr': ['le chat', 'la chatte,'] 'de': 'die katze' } # Tensor returned : { 'language': ['en', 'de', 'fr', 'fr'], 'translation': ['the cat', 'die katze', 'la chatte', 'le chat'], }
MetricInfo
¶
-
class
datasets.
MetricInfo
(description: str, citation: str, features: datasets.features.Features, inputs_description: str = <factory>, homepage: str = <factory>, license: str = <factory>, codebase_urls: List[str] = <factory>, reference_urls: List[str] = <factory>, streamable: bool = False, format: Optional[str] = None, metric_name: Optional[str] = None, config_name: Optional[str] = None, experiment_id: Optional[str] = None)[source]¶ Information about a metric.
MetricInfo documents a metric, including its name, version, and features. See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
Metric
¶
The base class Metric
implements a Metric backed by one or several datasets.Dataset
.
-
class
datasets.
Metric
(config_name: Optional[str] = None, keep_in_memory: bool = False, cache_dir: Optional[str] = None, num_process: int = 1, process_id: int = 0, seed: Optional[int] = None, experiment_id: Optional[str] = None, max_concurrent_cache_files: int = 10000, timeout: Union[int, float] = 100, **kwargs)[source]¶ A Metrics is the base class and common API for all metrics.
- Parameters
config_name (
str
) – This is used to define a hash specific to a metrics computation script and prevents the metric’s data to be overridden when the metric loading script is modified.keep_in_memory (
bool
) – keep all predictions and references in memory. Not possible in distributed settings.cache_dir (
str
) – Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups.num_process (
int
) – specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).process_id (
int
) – specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).seed (Optional
int
) – If specified, this will temporarily set numpy’s random seed whendatasets.Metric.compute()
is run.experiment_id (
str
) – A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).max_concurrent_cache_files (
int
) – Max number of concurrent metrics cache files (default 10000).timeout (
Union[int, float]
) – Timeout in second for distributed setting synchronization.
-
add
(*, prediction=None, reference=None)[source]¶ Add one prediction and reference for the metric’s stack.
-
add_batch
(*, predictions=None, references=None)[source]¶ Add a batch of predictions and references for the metric’s stack.
-
compute
(*args, **kwargs) → Optional[dict][source]¶ Compute the metrics.
- Parameters
disallow the usage of positional arguments to prevent mistakes (We) –
predictions (Optional list/array/tensor) – predictions
references (Optional list/array/tensor) – references
**kwargs (Optional other kwargs) – will be forwared to the metrics
_compute()
method (see details in the docstring)
- Returns
Dictionnary with the metrics if this metric is run on the main process (process_id == 0) None if the metric is not run on the main process (process_id != 0)
-
download_and_prepare
(download_config: Optional[datasets.utils.file_utils.DownloadConfig] = None, dl_manager: Optional[datasets.utils.download_manager.DownloadManager] = None, **download_and_prepare_kwargs)[source]¶ Downloads and prepares dataset for reading.
- Parameters
(Optional datasets.DownloadConfig (download_config) – specific download configuration parameters.
dl_manager (Optional
datasets.DownloadManager
) – specific Download Manger to use