File size: 5,224 Bytes
9094cc2
5111f27
675f890
9094cc2
 
 
31b9ddb
9094cc2
 
 
a92816b
9094cc2
9ef3bbd
 
bd9a5d5
 
9ef3bbd
 
 
 
a92816b
 
 
 
 
 
e6fac54
a92816b
 
 
 
969d45a
 
a92816b
 
 
e6fac54
a92816b
 
 
 
 
e6fac54
 
 
85a8429
e6fac54
 
6195d8f
a92816b
 
 
e6fac54
 
6195d8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6fac54
 
6195d8f
 
 
 
 
 
 
 
 
e6fac54
 
449b6ea
580b4e4
 
 
 
 
 
969d45a
6195d8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
title: Model Evaluator
emoji: πŸ“Š
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: 1.10.0
app_file: app.py
---

# Model Evaluator

> Submit evaluation jobs to AutoTrain from the Hugging Face Hub

**⚠️ This project has been archived. If you want to evaluate LLMs, checkout [this collection](https://huggingface.co/collections/clefourrier/llm-leaderboards-and-benchmarks-✨-64f99d2e11e92ca5568a7cce) of leaderboards.**

## Supported tasks

The table below shows which tasks are currently supported for evaluation in the AutoTrain backend:

| Task                               | Supported |
|:-----------------------------------|:---------:|
| `binary_classification`            |     βœ…     |
| `multi_class_classification`       |     βœ…     |
| `multi_label_classification`       |     ❌     |
| `entity_extraction`                |     βœ…     |
| `extractive_question_answering`    |     βœ…     |
| `translation`                      |     βœ…     |
| `summarization`                    |     βœ…     |
| `image_binary_classification`      |     βœ…     |
| `image_multi_class_classification` |     βœ…     |
| `text_zero_shot_evaluation`        |     βœ…     |


## Installation

To run the application locally, first clone this repository and install the dependencies as follows:

```
pip install -r requirements.txt
```

Next, copy the example file of environment variables:

```
cp .env.template .env
```

and set the `HF_TOKEN` variable with a valid API token from the [`autoevaluator`](https://huggingface.co/autoevaluator) bot user. Finally, spin up the application by running:

```
streamlit run app.py
```

## Usage

Evaluation on the Hub involves two main steps:

1. Submitting an evaluation job via the UI. This creates an AutoTrain project with `N` models for evaluation. At this stage, the dataset is also processed and prepared for evaluation.
2. Triggering the evaluation itself once the dataset is processed.

From the user perspective, only step (1) is needed since step (2) is handled by a cron job on GitHub Actions that executes the `run_evaluation_jobs.py` script every 15 minutes.

See below for details on manually triggering evaluation jobs.

### Triggering an evaluation

To evaluate the models in an AutoTrain project, run:

```
python run_evaluation_jobs.py
```

This will download the [`autoevaluate/evaluation-job-logs`](https://huggingface.co/datasets/autoevaluate/evaluation-job-logs) dataset from the Hub and check which evaluation projects are ready for evaluation (i.e. those whose dataset has been processed).

## AutoTrain configuration details

Models are evaluated by the [`autoevaluator`](https://huggingface.co/autoevaluator) bot user in AutoTrain, with the payload sent to the `AUTOTRAIN_BACKEND_API` environment variable. Evaluation projects are created and run on either the `prod` or `staging` environments. You can view the status of projects in the AutoTrain UI by navigating to one of the links below (ask internally for access to the staging UI):

| AutoTrain environment |                                                AutoTrain UI URL                                                |           `AUTOTRAIN_BACKEND_API`            |
|:---------------------:|:--------------------------------------------------------------------------------------------------------------:|:--------------------------------------------:|
|        `prod`         |         [`https://ui.autotrain.huggingface.co/projects`](https://ui.autotrain.huggingface.co/projects)         |     https://api.autotrain.huggingface.co     |
|       `staging`       | [`https://ui-staging.autotrain.huggingface.co/projects`](https://ui-staging.autotrain.huggingface.co/projects) | https://api-staging.autotrain.huggingface.co |


The current configuration for evaluation jobs running on [Spaces](https://huggingface.co/spaces/autoevaluate/model-evaluator) is:

```
AUTOTRAIN_BACKEND_API=https://api.autotrain.huggingface.co
```

To evaluate models with a _local_ instance of AutoTrain, change the environment to:

```
AUTOTRAIN_BACKEND_API=http://localhost:8000
```

### Migrating from staging to production (and vice versa)

In general, evaluation jobs should run in AutoTrain's `prod` environment, which is defined by the following environment variable:

```
AUTOTRAIN_BACKEND_API=https://api.autotrain.huggingface.co
```

However, there are times when it is necessary to run evaluation jobs in AutoTrain's `staging` environment (e.g. because a new evaluation pipeline is being deployed). In these cases the corresponding environement variable is:

```
AUTOTRAIN_BACKEND_API=https://api-staging.autotrain.huggingface.co
```

To migrate between these two environments, update the `AUTOTRAIN_BACKEND_API` in two places:

* In the [repo secrets](https://huggingface.co/spaces/autoevaluate/model-evaluator/settings) associated with the `model-evaluator` Space. This will ensure evaluation projects are created in the desired environment.
* In the [GitHub Actions secrets](https://github.com/huggingface/model-evaluator/settings/secrets/actions) associated with this repo. This will ensure that the correct evaluation jobs are approved and launched via the `run_evaluation_jobs.py` script.