Main classes

DatasetInfo

class nlp.DatasetInfo(description: str = <factory>, citation: str = <factory>, homepage: str = <factory>, license: str = <factory>, features: Optional[nlp.features.Features] = None, post_processed: Optional[nlp.info.PostProcessedInfo] = None, supervised_keys: Optional[nlp.info.SupervisedKeysData] = None, builder_name: Optional[str] = None, config_name: Optional[str] = None, version: Optional[Union[str, nlp.utils.version.Version]] = None, splits: Optional[dict] = None, download_checksums: Optional[dict] = None, download_size: Optional[int] = None, post_processing_size: Optional[int] = None, dataset_size: Optional[int] = None, size_in_bytes: Optional[int] = None)[source]

Information about a dataset.

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Note: Not all fields are known on construction and may be updated later.

classmethod from_directory(dataset_info_dir: dict) → nlp.info.DatasetInfo[source]

Create DatasetInfo from the JSON file in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.

This will overwrite all previous metadata.

Parameters

dataset_info_dirstr The directory containing the metadata file. This should be the root directory of a specific dataset version.

write_to_directory(dataset_info_dir)[source]

Write DatasetInfo as JSON to dataset_info_dir. Also save the license separately in LICENCE.

Dataset

The base class nlp.Dataset implements a Dataset backed by an Apache Arrow table.

class nlp.Dataset(arrow_table: pyarrow.lib.Table, data_files: Optional[List[dict]] = None, info: Optional[nlp.info.DatasetInfo] = None, split: Optional[nlp.splits.NamedSplit] = None)[source]

A Dataset backed by an Arrow table or Record Batch.

__getitem__(key: Union[int, slice, str]) → Union[Dict, List][source]

Can be used to index columns (by string names) or rows (by integer index)

__iter__()[source]

Iterate through the examples. If a formating is set with nlp.Dataset.set_format() rows will be returned with the selected format.

__len__()[source]

Number of rows in the dataset

add_elasticsearch_index(column: str, index_name: Optional[str] = None, host: Optional[str] = None, port: Optional[int] = None, es_client: Optional[elasticsearch.Elasticsearch] = None, es_index_name: Optional[str] = None, es_index_config: Optional[dict] = None)[source]

Add a text index using ElasticSearch for fast retrieval. This is done in-place.

Parameters
  • column (str) – The column of the documents to add to the index.

  • index_name (Optional str) – The index_name/identifier of the index. This is the index name that is used to call nlp.Dataset.get_nearest_examples() or nlp.Dataset.search(). By default it corresponds to column.

  • documents (Union[List[str], nlp.Dataset]) – The documents to index. It can be a nlp.Dataset.

  • es_client (elasticsearch.Elasticsearch) – The elasticsearch client used to create the index.

  • es_index_name (Optional str) – The elasticsearch index name used to create the index.

  • es_index_config (Optional dict) – The configuration of the elasticsearch index. Default config is:

Config:

{
    "settings": {
        "number_of_shards": 1,
        "analysis": {"analyzer": {"stop_standard": {"type": "standard", " stopwords": "_english_"}}},
    },
    "mappings": {
        "properties": {
            "text": {
                "type": "text",
                "analyzer": "standard",
                "similarity": "BM25"
            },
        }
    },
}

Example:

es_client = elasticsearch.Elasticsearch()
ds = nlp.load_dataset('crime_and_punish', split='train')
ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)
add_faiss_index(column: str, index_name: Optional[str] = None, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]

Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here:

Parameters
  • column (str) – The column of the vectors to add to the index.

  • index_name (Optional str) – The index_name/identifier of the index. This is the index_name that is used to call nlp.Dataset.get_nearest_examples() or nlp.Dataset.search(). By default it corresponds to column.

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

  • string_factory (Optional str) – This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat.

  • metric_type (Optional int) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.

  • custom_index (Optional faiss.Index) – Custom Faiss index that you already have instantiated and configured for your needs.

  • train_size (Optional int) – If the index needs a training step, specifies how many vectors will be used to train the index.

  • faiss_verbose (bool, defaults to False) – Enable the verbosity of the Faiss index.

  • dtype (data-type) – The dtype of the numpy arrays that are indexed. Default is np.float32.

Example:

ds = nlp.load_dataset('crime_and_punish', split='train')
ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
ds_with_embeddings.add_faiss_index(column='embeddings')
# query
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
# save index
ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')

ds = nlp.load_dataset('crime_and_punish', split='train')
# load index
ds.load_faiss_index('embeddings', 'my_index.faiss')
# query
scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)
add_faiss_index_from_external_arrays(external_arrays: numpy.array, index_name: str, device: Optional[int] = None, string_factory: Optional[str] = None, metric_type: Optional[int] = None, custom_index: Optional[faiss.Index] = None, train_size: Optional[int] = None, faiss_verbose: bool = False, dtype=<class 'numpy.float32'>)[source]

Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays. You can specify device if you want to run it on GPU (device must be the GPU index). You can find more information about Faiss here: - For string factory

Parameters
  • external_arrays (np.array) – If you want to use arrays from outside the lib for the index, you can set external_arrays. It will use external_arrays to create the Faiss index instead of the arrays in the given column.

  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call nlp.Dataset.get_nearest_examples() or nlp.Dataset.search().

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

  • string_factory (Optional str) – This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat.

  • metric_type (Optional int) – Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT or faiss.METRIC_L2.

  • custom_index (Optional faiss.Index) – Custom Faiss index that you already have instantiated and configured for your needs.

  • train_size (Optional int) – If the index needs a training step, specifies how many vectors will be used to train the index.

  • faiss_verbose (bool, defaults to False) – Enable the verbosity of the Faiss index.

  • dtype (numpy.dtype) – The dtype of the numpy arrays that are indexed. Default is np.float32.

cast_(features: nlp.features.Features)[source]

Cast the dataset to a new set of features.

You can also remove a column using Dataset.map() with feature but cast_() is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

features (nlp.Features) – New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

cleanup_cache_files()[source]

Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be carefull when running this command that no other process is currently using other cache files.

Returns

Number of removed files

dictionary_encode_column(column: str)[source]
Dictionary encode a column.

Dictionnary encode can reduce the size of a column with many repetitions (e.g. string labels columns) by storing a dictionnary of the strings. This only affect the internal storage.

Parameters

column (str) –

drop(columns: Union[str, List[str]])[source]

Drop one or more columns.

Parameters

columns (str or List[str]) – Column or list of columns to remove from the dataset.

drop_index(index_name: str)

Drop the index with the specified column.

Parameters

index_name (str) – The index_name/identifier of the index.

export(filename: str, format: str = 'tfrecord')[source]

Writes the Arrow dataset to a TFRecord file.

The dataset must already be in tensorflow format. The records will be written with keys from dataset._format_columns.

Parameters
  • filename (str) – The filename, including the .tfrecord extension, to write to.

  • (Optional[str] (format) – “tfrecord”): The type of output file. Currently this is a no-op, as TFRecords are the only option. This enables a more flexible function signature later.

  • default“tfrecord”): The type of output file. Currently this is a no-op, as TFRecords are the only option. This enables a more flexible function signature later.

filter(function, with_indices=False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.arrow_dataset.Dataset[source]

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> bool if with_indices=False - function(example: Dict, indices: int) -> bool if with_indices=True

  • (bool (verbose) – False): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • defaultFalse): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • (Optional[int] (batch_size) – 1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • default1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • (Optional[List[str]] (remove_columns) – None): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • defaultNone): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • (boolFalse): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

flatten(max_depth=16)[source]

Flatten the Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

formated_as(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

To be used in a with statement. Set __getitem__ return format (type and columns)

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formated columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

classmethod from_buffer(buffer: pyarrow.lib.Buffer, info: Optional[nlp.info.DatasetInfo] = None, split: Optional[nlp.splits.NamedSplit] = None) → nlp.arrow_dataset.Dataset[source]

Instantiate a Dataset backed by an Arrow buffer

classmethod from_file(filename: str, info: Optional[nlp.info.DatasetInfo] = None, split: Optional[nlp.splits.NamedSplit] = None) → nlp.arrow_dataset.Dataset[source]

Instantiate a Dataset backed by an Arrow table at filename

get_index(index_name: str) → nlp.search.BaseIndex

List the index_name/identifiers of all the attached indexes.

get_nearest_examples(index_name: str, query: Union[str, numpy.array], k: int = 10) → nlp.search.NearestExamplesResults

Find the nearest examples in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • query (Union[str, np.ndarray]) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve.

Ouput:

scores (List[float]): The retrieval scores of the retrieved examples. examples (dict): The retrieved examples.

get_nearest_examples_batch(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → nlp.search.BatchedNearestExamplesResults

Find the nearest examples in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • queries (Union[List[str], np.ndarray]) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve per query.

Ouput:

total_scores (List[List[float]): The retrieval scores of the retrieved examples per query. total_examples (List[dict]): The retrieved examples per query.

property info

nlp.DatasetInfo object containing all the metadata in the dataset.

list_indexes() → List[str]

List the colindex_nameumns/identifiers of all the attached indexes.

load_faiss_index(index_name: str, file: str, device: Optional[int] = None)

Load a FaissIndex from disk. If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index to make it fit your needs

Parameters
  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.

  • file (str) – The path to the serialized faiss index on disk.

  • device (Optional int) – If not None, this is the index of the GPU to use. By default it uses the CPU.

map(function, with_indices: bool = False, batched: bool = False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, features: Optional[nlp.features.Features] = None, disable_nullable: bool = True, verbose: bool = True) → nlp.arrow_dataset.Dataset[source]

Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples).

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> Union[Dict, Any] if batched=False and with_indices=False - function(example: Dict, indices: int) -> Union[Dict, Any] if batched=False and with_indices=True - function(batch: Dict[List]) -> Union[Dict, Any] if batched=True and with_indices=False - function(batch: Dict[List], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True

  • (bool (verbose) – False): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • defaultFalse): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • (boolFalse): Provide batch of examples to function

  • defaultFalse): Provide batch of examples to function

  • (Optional[int] (batch_size) – 1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • default1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • (Optional[List[str]] (remove_columns) – None): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • defaultNone): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • (boolFalse): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (Optional[nlp.Features] (features) – None): Use a specific Features to store the cache file instead of the automatically generated one.

  • defaultNone): Use a specific Features to store the cache file instead of the automatically generated one.

  • (boolTrue): Allow null values in the table.

  • defaultTrue): Allow null values in the table.

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

remove_column_(column_name: str)[source]

Remove a column in the dataset and the features associated to the column.

You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

column_name (str) – Name of the column to remove.

rename_column_(original_column_name: str, new_column_name: str)[source]

Rename a column in the dataset and move the features associated to the original column under the new column name.

You can also rename a column using Dataset.map() with remove_columns but the present method:
  • takes care of moving the original features under the new column name.

  • doesn’t copy the data to a new dataset and is thus much faster.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

reset_format()[source]

Reset __getitem__ return format to python objects and all columns.

Same as self.set_format()

save_faiss_index(index_name: str, file: str)

Save a FaissIndex on disk

Parameters
  • index_name (str) – The index_name/identifier of the index. This is the index_name that is used to call .get_nearest or .search.

  • file (str) – The path to the serialized faiss index on disk.

search(index_name: str, query: Union[str, numpy.array], k: int = 10) → nlp.search.SearchResults

Find the nearest examples indices in the dataset to the query.

Parameters
  • index_name (str) – The name/identifier of the index.

  • query (Union[str, np.ndarray]) – The query as a string if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve.

Ouput:

scores (List[List[float]): The retrieval scores of the retrieved examples. indices (List[List[int]]): The indices of the retrieved examples.

search_batch(index_name: str, queries: Union[List[str], numpy.array], k: int = 10) → nlp.search.BatchedSearchResults

Find the nearest examples indices in the dataset to the query.

Parameters
  • index_name (str) – The index_name/identifier of the index.

  • queries (Union[List[str], np.ndarray]) – The queries as a list of strings if index_name is a text index or as a numpy array if index_name is a vector index.

  • k (int) – The number of examples to retrieve per query.

Ouput:

total_scores (List[List[float]): The retrieval scores of the retrieved examples per query. total_indices (List[List[int]]): The indices of the retrieved examples per query.

select(indices: Union[List[int], numpy.ndarray], keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, reader_batch_size: Optional[int] = 1000, verbose: bool = True)[source]

Create a new dataset with rows selected following the list/array of indices.

Parameters
  • indices (Union[List[int], np.ndarray]) – List or 1D-NumPy array of integer indices for indexing.

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (reader_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (int1000): Number of rows per __getitem__ operation when reading from disk. Higher values may make reading faster but will also consume more temporary memory and make the progress bar less responsive.

  • default1000): Number of rows per __getitem__ operation when reading from disk. Higher values may make reading faster but will also consume more temporary memory and make the progress bar less responsive.

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

set_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns)

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formated columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

shard(num_shards: int, index: int, contiguous: bool = False, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.arrow_dataset.Dataset[source]

Return the index-nth shard from dataset split into num_shards pieces.

This shards deterministically. dset.shard(n, i) will contain all elements of dset whose index mod n = i.

dset.shard(n, i, contiguous=True) will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l, then the first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i). nlp.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)]) will return a dataset with the same order as the original.

Be sure to shard before using any randomizing operator (such as shuffle). It is best if the shard operator is used early in the dataset pipeline.

Parameters
  • num_shards (int) – How many shards to split the dataset into.

  • index (int) – Which shard to select and return.

  • contiguous – (bool, default: False): Whether to select contiguous blocks of indices for shards.

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

shuffle(seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.arrow_dataset.Dataset[source]

Create a new Dataset where the rows are shuffled.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Parameters
  • seed (Optional int) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.

  • generator (Optional np.random.Generator) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

sort(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.arrow_dataset.Dataset[source]

Create a new dataset sorted according to a column.

Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).

Parameters
  • column (str) – column name to sort by.

  • reverse – (bool, default: False): If True, sort by descending order rather then ascending.

  • kind (Optional str) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (cache_file_name) – None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

property split

nlp.DatasetInfo object containing all the metadata in the dataset.

train_test_split(test_size: Optional[Union[float, int]] = None, train_size: Optional[Union[float, int]] = None, shuffle: bool = True, seed: Optional[int] = None, generator: Optional[numpy.random._generator.Generator] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, train_cache_file_name: Optional[str] = None, test_cache_file_name: Optional[str] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True)DatasetDict[source]

Return a dictionary (nlp.DatsetDict) with two random train and test subsets (train and test Dataset splits). Splits are created from the dataset according to test_size, train_size and shuffle.

This method is similar to scikit-learn train_test_split with the omission of the stratified options.

Parameters
  • test_size (Optional np.random.Generator) – Size of the test split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

  • train_size (Optional np.random.Generator) – Size of the train split If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

  • (Optional bool (shuffle) – True): Whether or not to shuffle the data before splitting.

  • defaultTrue): Whether or not to shuffle the data before splitting.

  • seed (Optional int) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.

  • generator (Optional np.random.Generator) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy).

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (Optional[str] (test_cache_file_name) – None): Provide the name of a cache file to use to store the train split calche file instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the train split calche file instead of the automatically generated cache file name.

  • (Optional[str]None): Provide the name of a cache file to use to store the test split calche file instead of the automatically generated cache file name.

  • defaultNone): Provide the name of a cache file to use to store the test split calche file instead of the automatically generated cache file name.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

unique(column: str) → List[source]

Return a list of the unique elements in a column.

This is implemented in the low-level backend and as such, very fast.

Parameters

column (str) – column name (list all the column names with nlp.Dataset.column_names())

Returns: list of unique elements in the given column.

DatasetDict

Dictionary with split names as keys (‘train’, ‘test’ for example), and nlp.Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.

class nlp.DatasetDict[source]

A dictionary (dict of str: nlp.Dataset) with dataset transforms methods (map, filter, etc.)

cast_(features: nlp.features.Features)[source]

Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.

You can also remove a column using Dataset.map() with feature but cast_() is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

features (nlp.Features) – New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string <-> ClassLabel you should use map() to update the Dataset.

filter(function, with_indices=False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.dataset_dict.DatasetDict[source]

Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> bool if with_indices=False - function(example: Dict, indices: int) -> bool if with_indices=True

  • (bool (verbose) – False): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • defaultFalse): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • (Optional[int] (batch_size) – 1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • default1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • (Optional[List[str]] (remove_columns) – None): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • defaultNone): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • (boolFalse): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (`Optional[Dict[str (cache_file_names) –

    None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • str]]`None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

formated_as(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

To be used in a with statement. Set __getitem__ return format (type and columns) The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formated columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

map(function, with_indices: bool = False, batched: bool = False, batch_size: Optional[int] = 1000, remove_columns: Optional[List[str]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, features: Optional[nlp.features.Features] = None, disable_nullable: bool = True, verbose: bool = True) → nlp.dataset_dict.DatasetDict[source]

Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • function (callable) – with one of the following signature: - function(example: Dict) -> Union[Dict, Any] if batched=False and with_indices=False - function(example: Dict, indices: int) -> Union[Dict, Any] if batched=False and with_indices=True - function(batch: Dict[List]) -> Union[Dict, Any] if batched=True and with_indices=False - function(batch: Dict[List], indices: List[int]) -> Union[Dict, Any] if batched=True and with_indices=True

  • (bool (verbose) – False): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • defaultFalse): Provide example indices to function. Note that in this case the signature of function should be def function(example, idx): ….

  • (boolFalse): Provide batch of examples to function

  • defaultFalse): Provide batch of examples to function

  • (Optional[int] (batch_size) – 1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • default1000): Number of examples per batch provided to function if batched=True batch_size <= 0 or batch_size == None: Provide the full dataset as a single batch to function

  • (Optional[List[str]] (remove_columns) – None): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • defaultNone): Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function, i.e. if function is adding columns with names in remove_columns, these columns will be kept.

  • (boolFalse): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (`Optional[Dict[str (cache_file_names) –

    None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • str]]`None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (Optional[nlp.Features] (features) – None): Use a specific Features to store the cache file instead of the automatically generated one.

  • defaultNone): Use a specific Features to store the cache file instead of the automatically generated one.

  • (boolTrue): Allow null values in the table.

  • defaultTrue): Allow null values in the table.

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

remove_column_(column_name: str)[source]

Remove a column in the dataset and the features associated to the column. The transformation is applied to all the datasets of the dataset dictionary.

You can also remove a column using Dataset.map() with remove_columns but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.

Parameters

column_name (str) – Name of the column to remove.

rename_column_(original_column_name: str, new_column_name: str)[source]

Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.

You can also rename a column using Dataset.map() with remove_columns but the present method:
  • takes care of moving the original features under the new column name.

  • doesn’t copy the data to a new dataset and is thus much faster.

Parameters
  • original_column_name (str) – Name of the column to rename.

  • new_column_name (str) – New name for the column.

reset_format()[source]

Reset __getitem__ return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.

Same as self.set_format()

set_format(type: Optional[str] = None, columns: Optional[List] = None, output_all_columns: bool = False, **format_kwargs)[source]

Set __getitem__ return format (type and columns) The transformation is applied to all the datasets of the dataset dictionary.

Parameters
  • type (Optional str) – output type selected in [None, ‘numpy’, ‘torch’, ‘tensorflow’, ‘pandas’] None means __getitem__ returns python objects (default)

  • columns (Optional List[str]) – columns to format in the output None means __getitem__ returns all columns (default)

  • output_all_columns (bool default to False) – keep un-formated columns as well in the output (as python objects)

  • format_kwargs – keywords arguments passed to the convert function like np.array, torch.tensor or tensorflow.ragged.constant.

shuffle(seeds: Optional[Dict[str, int]] = None, generators: Optional[Dict[str, numpy.random._generator.Generator]] = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True)[source]

Create a new Dataset where the rows are shuffled. The transformation is applied to all the datasets of the dataset dictionary.

Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).

Parameters
  • seeds (Optional Dict[str, int]) – A seed to initialize the default BitGenerator if generator=None. If None, then fresh, unpredictable entropy will be pulled from the OS. If an int or array_like[ints] is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You have to provide one seed per dataset in the dataset dictionary.

  • generators (Optional Dict[str, np.random.Generator]) – Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None (default), uses np.random.default_rng (the default BitGenerator (PCG64) of NumPy). You have to provide one generator per dataset in the dataset dictionary.

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (`Optional[Dict[str (cache_file_names) –

    None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • str]]`None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

sort(column: str, reverse: bool = False, kind: str = None, keep_in_memory: bool = False, load_from_cache_file: bool = True, cache_file_names: Optional[Dict[str, str]] = None, writer_batch_size: Optional[int] = 1000, verbose: bool = True) → nlp.dataset_dict.DatasetDict[source]

Create a new dataset sorted according to a column. The transformation is applied to all the datasets of the dataset dictionary.

Currently sorting according to a column name uses numpy sorting algorithm under the hood. The column should thus be a numpy compatible type (in particular not a nested type). This also means that the column used for sorting is fully loaded in memory (which should be fine in most cases).

Parameters
  • column (str) – column name to sort by.

  • reverse – (bool, default: False): If True, sort by descending order rather then ascending.

  • kind (Optional str) – Numpy algorithm for sorting selected in {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with data type. The ‘mergesort’ option is retained for backwards compatibility.

  • (bool (verbose) – False): Keep the dataset in memory instead of writing it to a cache file.

  • defaultFalse): Keep the dataset in memory instead of writing it to a cache file.

  • (boolTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • defaultTrue): If a cache file storing the current computation from function can be identified, use it instead of recomputing.

  • (`Optional[Dict[str (cache_file_names) –

    None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • str]]`None): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • defaultNone): Provide the name of a cache file to use to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name per dataset in the dataset dictionary.

  • (int (writer_batch_size) – 1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • default1000): Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory while running .map().

  • (boolTrue): Set to False to deactivate the tqdm progress bar and informations.

  • defaultTrue): Set to False to deactivate the tqdm progress bar and informations.

Features

class nlp.Features[source]
class nlp.Sequence(feature: Any, length: int = - 1, id: Optional[str] = None)[source]

Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.

class nlp.ClassLabel(num_classes: int = None, names: List[str] = None, names_file: str = None, id: Optional[str] = None)[source]

Handle integer class labels. Here for compatiblity with tfds.

There are 3 ways to define a ClassLabel, which correspond to the 3 :param * num_classes: create 0 to (num_classes-1) labels :param * names: a list of label strings :param * names_file: a file containing the list of labels.

Note: On python2, the strings are encoded as utf-8.

Parameters
  • num_classesint, number of classes. All labels must be < num_classes.

  • nameslist<str>, string names for the integer classes. The order in which the names are provided is kept.

  • names_filestr, path to a file with names for the integer classes, one per line.

int2str(values: Union[int, collections.abc.Iterable])[source]

Conversion integer => class name string.

str2int(values: collections.abc.Iterable)[source]

Conversion class name string => integer.

class nlp.Value(dtype: str, id: Optional[str] = None)[source]

Encapsulate an Arrow datatype for easy serialization.

class nlp.Tensor(shape: Union[Tuple[int], List[int]], dtype: str, id: Optional[str] = None)[source]

Construct a 0D or 1D Tensor feature. If 0D, the Tensor is an dtype element, if 1D it will be a fixed length list or dtype elements. Mostly here for compatiblity with tfds.

class nlp.Translation(languages: List[str], id: Optional[str] = None)[source]

FeatureConnector for translations with fixed languages per example. Here for compatiblity with tfds.

Input: The Translate feature accepts a dictionary for each example mapping

string language codes to string translations.

Output: A dictionary mapping string language codes to translations as Text

features.

Example:

# At construction time:

nlp.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
        'en': 'the cat',
        'fr': 'le chat',
        'de': 'die katze'
}
class nlp.TranslationVariableLanguages(languages: List = None, num_languages: int = None, id: Optional[str] = None)[source]

FeatureConnector for translations with variable languages per example. Here for compatiblity with tfds.

Input: The TranslationVariableLanguages feature accepts a dictionary for each

example mapping string language codes to one or more string translations. The languages present may vary from example to example.

Output:
language: variable-length 1D tf.Tensor of tf.string language codes, sorted

in ascending order.

translation: variable-length 1D tf.Tensor of tf.string plain text

translations, sorted to align with language codes.

Example:

# At construction time:

nlp.features.Translation(languages=['en', 'fr', 'de'])

# During data generation:

yield {
        'en': 'the cat',
        'fr': ['le chat', 'la chatte,']
        'de': 'die katze'
}

# Tensor returned :

{
        'language': ['en', 'de', 'fr', 'fr'],
        'translation': ['the cat', 'die katze', 'la chatte', 'le chat'],
}

MetricInfo

class nlp.MetricInfo(description: str, citation: str, features: nlp.features.Features, inputs_description: str = <factory>, homepage: str = <factory>, license: str = <factory>, codebase_urls: List[str] = <factory>, reference_urls: List[str] = <factory>, streamable: bool = False, format: Optional[str] = None, metric_name: Optional[str] = None, config_name: Optional[str] = None, version: Optional[str] = None)[source]

Information about a metric.

MetricInfo documents a metric, including its name, version, and features. See the constructor arguments and properties for a full list.

Note: Not all fields are known on construction and may be updated later.

classmethod from_directory(metric_info_dir) → nlp.info.MetricInfo[source]

Create MetricInfo from the JSON file in metric_info_dir.

Parameters

metric_info_dirstr The directory containing the metadata file. This should be the root directory of a specific dataset version.

write_to_directory(metric_info_dir)[source]

Write MetricInfo as JSON to metric_info_dir. Also save the license separately in LICENCE.

Metric

The base class Metric implements a Metric backed by one or several nlp.Dataset.

class nlp.Metric(name: str = None, experiment_id: Optional[str] = None, process_id: int = 0, num_process: int = 1, data_dir: Optional[str] = None, in_memory: bool = False, hash: str = None, seed: Optional[int] = None, **kwargs)[source]
add(prediction=None, reference=None, **kwargs)[source]

Add one prediction and reference for the metric’s stack.

add_batch(predictions=None, references=None, **kwargs)[source]

Add a batch of predictions and references for the metric’s stack.

compute(predictions=None, references=None, timeout=120, **metrics_kwargs)[source]

Compute the metrics.

download_and_prepare(download_config: Optional[nlp.utils.file_utils.DownloadConfig] = None, dl_manager: Optional[nlp.utils.download_manager.DownloadManager] = None, **download_and_prepare_kwargs)[source]

Downloads and prepares dataset for reading.

Parameters
  • (Optional nlp.DownloadConfig (download_config) – specific download configuration parameters.

  • dl_manager (Optional nlp.DownloadManager) – specific Download Manger to use

finalize(timeout=120)[source]

Close all the writing process and load/gather the data from all the nodes if main node or all_process is True.