Creating and sharing a new evaluation
All evaluation modules, be it metrics, comparisons, or measurements live on the π€ Hub in a Space (see for example Accuracy). In principle, you could setup a new Space and add a new module following the same structure. However, we added a CLI that makes creating a new evaluation module much easier:
evaluate-cli create "My Metric" module_type="metric"
This will create a new Space on the π€ Hub, clone it locally, and populate it with a template. Instructions on how to fill the template will be displayed in the terminal, but are also explained here in more detail.
For more information about Spaces, see the Spaces documentation.
Module script
The evaluation module script (the file with suffix *.py
) is the core of the new module and includes all the code for computing the evaluation.
Attributes
Start by adding some information about your evalution module in EvaluationModule._info()
. The most important attributes you should specify are:
EvaluationModuleInfo.description
provides a brief description about your evalution module.EvaluationModuleInfo.citation
contains a BibTex citation for the evalution module.EvaluationModuleInfo.inputs_description
describes the expected inputs and outputs. It may also provide an example usage of the evalution module.EvaluationModuleInfo.features
defines the name and type of the predictions and references. This has to be either a singledatasets.Features
object or a list ofdatasets.Features
objects if multiple input types are allowed.
Then, we can move on to prepare everything before the actual computation.
Download
Some evaluation modules require some external data such as NLTK that requires resources or the BLEURT metric that requires checkpoints. You can implement these downloads in EvaluationModule._download_and_prepare()
, which downloads and caches the resources via the dlmanager
. A simplified example on how BLEURT downloads and loads a checkpoint:
def _download_and_prepare(self, dl_manager):
model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name])
self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name))
Or if you need to download the NLTK "punkt"
resources:
def _download_and_prepare(self, dl_manager):
import nltk
nltk.download("punkt")
Next, we need to define how the computation of the evaluation module works.
Compute
The computation is performed in the EvaluationModle._compute
method. It takes the same arguments as EvaluationModuleInfo.features
and should then return the result as a dictionary. Here an example of an exact match metric:
def _compute(self, references, predictions):
em = sum([r==p for r, p in zip(references, predictions)])/len(references)
return {"exact_match": em}
This method is used when you call .compute()
later on.
Readme
When you use the evalute-cli
to setup the evaluation module the Readme structure and instructions are automatically created. It should include a general description of the metric, information about its input/output format, examples as well as information about its limiations or biases and references.
Requirements
If your evaluation modules has additional dependencies (e.g. sklearn
or nltk
) the requirements.txt
files is the place to put them. The file follows the pip
format and you can list all dependencies there.
App
The app.py
is where the Spaces widget lives. In general it looks like the following and does not require any changes:
import evaluate
from evaluate.utils import launch_gradio_widget
module = evaluate.load("lvwerra/element_count")
launch_gradio_widget(module)
If you want a custom widget you could add your gradio app here.
Push to Hub
Finally, when you are done with all the above changes it is time to push your evaluation module to the hub. To do so navigate to the folder of your module and git add/commit/push the changes to the hub:
cd PATH_TO_MODULE
git add .
git commit -m "Add my new, shiny module."
git push
Tada π! Your evaluation module is now on the π€ Hub and ready to be used by everybody!