Main classes
EvaluationModuleInfo
The base class EvaluationModuleInfo
implements a the logic for the subclasses MetricInfo
, ComparisonInfo
, and MeasurementInfo
.
class evaluate.EvaluationModuleInfo
< source >( description: str citation: str features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]] inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None module_type: str = 'metric' module_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Base class to store information about an evaluation used for MetricInfo
, ComparisonInfo
,
and MeasurementInfo
.
EvaluationModuleInfo
documents an evaluation, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
from_directory
< source >( metric_info_dir )
Create EvaluationModuleInfo
from the JSON file in metric_info_dir
.
write_to_directory
< source >( metric_info_dir )
Write EvaluationModuleInfo
as JSON to metric_info_dir
.
Also save the license separately in LICENSE.
class evaluate.MetricInfo
< source >( description: str citation: str features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]] inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None module_type: str = 'metric' module_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Information about a metric.
EvaluationModuleInfo
documents a metric, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
class evaluate.ComparisonInfo
< source >( description: str citation: str features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]] inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None module_type: str = 'comparison' module_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Information about a comparison.
EvaluationModuleInfo
documents a comparison, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
class evaluate.MeasurementInfo
< source >( description: str citation: str features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]] inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None module_type: str = 'measurement' module_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Information about a measurement.
EvaluationModuleInfo
documents a measurement, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
EvaluationModule
The base class EvaluationModule
implements a the logic for the subclasses Metric
, Comparison
, and Measurement
.
class evaluate.EvaluationModule
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None hash: str = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
- config_name (
str
) — This is used to define a hash specific to a module computation script and prevents the module’s data to be overridden when the module loading script is modified. - keep_in_memory (
bool
) — Keep all predictions and references in memory. Not possible in distributed settings. - cache_dir (
str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups. - num_process (
int
) — Specify the total number of nodes in a distributed settings. This is useful to compute module in distributed setups (in particular non-additive modules like F1). - process_id (
int
) — Specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute module in distributed setups (in particular non-additive metrics like F1). - seed (
int
, optional) — If specified, this will temporarily set numpy’s random seed when compute() is run. - experiment_id (
str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute module in distributed setups (in particular non-additive metrics like F1). - hash (
str
) — Used to identify the evaluation module according to the hashed file contents. - max_concurrent_cache_files (
int
) — Max number of concurrent module cache files (default10000
). - timeout (
Union[int, float]
) — Timeout in second for distributed setting synchronization.
A EvaluationModule
is the base class and common API for metrics, comparisons, and measurements.
add
< source >( prediction = None reference = None **kwargs )
Add one prediction and reference for the evaluation module’s stack.
add_batch
< source >( predictions = None references = None **kwargs )
Add a batch of predictions and references for the evaluation module’s stack.
compute
< source >( predictions = None references = None **kwargs ) → dict
or None
Parameters
- predictions (
list/array/tensor
, optional) — Predictions. - references (
list/array/tensor
, optional) — References. - **kwargs (optional) — Keyword arguments that will be forwarded to the evaluation module compute() method (see details in the docstring).
Returns
dict
or None
- Dictionary with the results if this evaluation module is run on the main process (
process_id == 0
). None
if the evaluation module is not run on the main process (process_id != 0
).
Compute the evaluation module.
Usage of positional arguments is not allowed to prevent mistakes.
download_and_prepare
< source >( download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None dl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None )
Downloads and prepares evaluation module for reading.
class evaluate.Metric
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None hash: str = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
- config_name (
str
) — This is used to define a hash specific to a metric computation script and prevents the metric’s data to be overridden when the metric loading script is modified. - keep_in_memory (
bool
) — Keep all predictions and references in memory. Not possible in distributed settings. - cache_dir (
str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups. - num_process (
int
) — Specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). - process_id (
int
) — Specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). - seed (
int
, optional) — If specified, this will temporarily set numpy’s random seed when compute() is run. - experiment_id (
str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1). - max_concurrent_cache_files (
int
) — Max number of concurrent metric cache files (default10000
). - timeout (
Union[int, float]
) — Timeout in second for distributed setting synchronization.
A Metric is the base class and common API for all metrics.
class evaluate.Comparison
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None hash: str = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
- config_name (
str
) — This is used to define a hash specific to a comparison computation script and prevents the comparison’s data to be overridden when the comparison loading script is modified. - keep_in_memory (
bool
) — Keep all predictions and references in memory. Not possible in distributed settings. - cache_dir (
str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups. - num_process (
int
) — Specify the total number of nodes in a distributed settings. This is useful to compute comparisons in distributed setups (in particular non-additive comparisons). - process_id (
int
) — Specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute comparisons in distributed setups (in particular non-additive comparisons). - seed (
int
, optional) — If specified, this will temporarily set numpy’s random seed when compute() is run. - experiment_id (
str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute comparisons in distributed setups (in particular non-additive comparisons). - max_concurrent_cache_files (
int
) — Max number of concurrent comparison cache files (default10000
). - timeout (
Union[int, float]
) — Timeout in second for distributed setting synchronization.
A Comparison is the base class and common API for all comparisons.
class evaluate.Measurement
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None hash: str = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
- config_name (
str
) — This is used to define a hash specific to a measurement computation script and prevents the measurement’s data to be overridden when the measurement loading script is modified. - keep_in_memory (
bool
) — Keep all predictions and references in memory. Not possible in distributed settings. - cache_dir (
str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups. - num_process (
int
) — Specify the total number of nodes in a distributed settings. This is useful to compute measurements in distributed setups (in particular non-additive measurements). - process_id (
int
) — Specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute measurements in distributed setups (in particular non-additive measurements). - seed (
int
, optional) — If specified, this will temporarily set numpy’s random seed when compute() is run. - experiment_id (
str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute measurements in distributed setups (in particular non-additive measurements). - max_concurrent_cache_files (
int
) — Max number of concurrent measurement cache files (default10000
). - timeout (
Union[int, float]
) — Timeout in second for distributed setting synchronization.
A Measurement is the base class and common API for all measurements.
CombinedEvaluations
The combine
function allows to combine multiple EvaluationModule
s into a single CombinedEvaluations
.
evaluate.combine
< source >( evaluations force_prefix = False )
Parameters
- evaluations (
Union[list, dict]
) — A list or dictionary of evaluation modules. The modules can either be passed as strings or loadedEvaluationModule
s. If a dictionary is passed its keys are the names used and the values the modules. The names are used as prefix in case there are name overlaps in the returned results of each module or ifforce_prefix=True
. - force_prefix (
bool
, optional, defaults toFalse
) — IfTrue
all scores from the modules are prefixed with their name. If a dictionary is passed the keys are used as name otherwise the module’s name.
Combines several metrics, comparisons, or measurements into a single CombinedEvaluations
object that
can be used like a single evaluation module.
If two scores have the same name, then they are prefixed with their module names. And if two modules have the same name, please use a dictionary to give them different names, otherwise an integer id is appended to the prefix.
add
< source >( prediction = None reference = None **kwargs )
Add one prediction and reference for each evaluation module’s stack.
add_batch
< source >( predictions = None references = None **kwargs )
Add a batch of predictions and references for each evaluation module’s stack.
compute
< source >( predictions = None references = None **kwargs ) → dict
or None
Parameters
- predictions (
list/array/tensor
, optional) — Predictions. - references (
list/array/tensor
, optional) — References. - **kwargs (optional) — Keyword arguments that will be forwarded to the evaluation module compute() method (see details in the docstring).
Returns
dict
or None
- Dictionary with the results if this evaluation module is run on the main process (
process_id == 0
). None
if the evaluation module is not run on the main process (process_id != 0
).
Compute each evaluation module.
Usage of positional arguments is not allowed to prevent mistakes.