File size: 5,999 Bytes
e284167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: DGEB
app_file : leaderboard/app.py
sdk: docker
sdk_version: 4.36.1
---
<h1 align="center">Diverse Genomic Embedding Benchmark</h1>

<p align="center">
    <a href="https://github.com/tattabio/dgeb/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/v/release/tattabio/dgeb.svg">
    </a>
    <a href="">
        <img alt="arXiv URL" src="">
    </a>
    <a href="https://github.com/tattabio/dgeb/blob/main/LICENSE">
        <img alt="License" src="https://img.shields.io/github/license/tattabio/dgeb.svg">
    </a>
    <a href="https://pepy.tech/project/dgeb">
        <img alt="Downloads" src="https://static.pepy.tech/personalized-badge/dgeb?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads">
    </a>
</p>

<h4 align="center">
    <p>
        <a href="#installation">Installation</a> |
        <a href="#usage">Usage</a> |
        <a href="https://huggingface.co/spaces/tattabio/DGEB">Leaderboard</a> |
        <a href="#citing">Citing</a>
    <p>
</h4>

<h3 align="center">
    <a href="https://huggingface.co/spaces/dgeb"><img style="float: middle; padding: 10px 10px 10px 10px;" width="100" height="100" src="./docs/images/tatta_logo.png" /></a>
</h3>

DGEB is a benchmark for evaluating biological sequence models on functional and evolutionary information.

DGEB is designed to evaluate model embeddings using:

- Diverse sequences accross the tree of life.
- Diverse tasks that capture different aspects of biological function.
- Both amino acid and nucleotide sequences.

The current version of DGEB consists of 18 datasets covering all three domains of life (Bacteria, Archaea and Eukarya). DGEB evaluates embeddings using six different embedding tasks: Classification, BiGene mining, Evolutionary Distance Similarity (EDS), Pair Classification, Clustering, and Retrieval.

We welcome contributions of new tasks and datasets.

## Installation

Install DGEB using pip.

```bash
pip install dgeb
```

## Usage

- Launch evaluation using the python script (see [cli.py](https://github.com/tattabio/dgeb/blob/main/dgeb/cli.py)):

```bash
dgeb --model facebook/esm2_t6_8M_UR50D
```

- To see all supported models and tasks:

```bash
dgeb --help
```

- Using the python API:

```py
import dgeb

model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
tasks = dgeb.get_tasks_by_modality(dgeb.Modality.PROTEIN)
evaluation = dgeb.DGEB(tasks=tasks)
evaluation.run(model, output_folder="results")
```

### Using a custom model

Custom models should be wrapped with the `dgeb.models.BioSeqTransformer` abstract class, and specify the modality, number of layers, and embedding dimension. See [models.py](https://github.com/tattabio/dgeb/blob/main/dgeb/models.py) for additional examples on custom model loading and inference.

```python
import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks.tasks import Modality

class MyModel(BioSeqTransformer):

    @property
    def modality(self) -> Modality:
        return Modality.PROTEIN

    @property
    def num_layers(self) -> int:
        return self.config.num_hidden_layers

    @property
    def embed_dim(self) -> int:
        return self.config.hidden_size


model = MyModel(model_name='path_to/huggingface_model')
tasks = dgeb.get_tasks_by_modality(model.modality)
evaluation = dgeb.DGEB(tasks=tasks)
evaluation.run(model)
```

### Evaluating on a custom dataset

**We strongly encourage users to contribute their custom datasets to DGEB. Please open a PR adding your dataset so that the community can benefit!**

To evaluate on a custom dataset, first upload your dataset to the [Huggingface Hub](https://huggingface.co/docs/hub/en/datasets-adding). Then define a `Task` subclass with `TaskMetadata` that points to your huggingface dataset. For example, a classification task on a custom dataset can be defined as follows:

```python
import dgeb
from dgeb.models import BioSeqTransformer
from dgeb.tasks import Dataset, Task, TaskMetadata, TaskResult
from dgeb.tasks.classification_tasks import run_classification_task

class MyCustomTask(Task):
    metadata = TaskMetadata(
        id="my_custom_classification",
        display_name="...",
        description="...",
        type="classification",
        modality=Modality.PROTEIN,
        datasets=[
            Dataset(
                path="path_to/huggingface_dataset",
                revision="...",
            )
        ],
        primary_metric_id="f1",
    )

    def run(self, model: BioSeqTransformer) -> TaskResult:
        return run_classification_task(model, self.metadata)

model = dgeb.get_model("facebook/esm2_t6_8M_UR50D")
evaluation = dgeb.DGEB(tasks=[MyCustomTask])
evaluation.run(model)
```

## Leaderboard

To add your submission to the DGEB leaderboard, proceed through the following instructions.

1. Fork the DGEB repository by following GitHub's instruction [Forking Workflow](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork).

2. Add your submission .json file to the leaderboard/submissions/<HF_MODEL_NAME>/ directory. 

```bash
mv /path/to/<SUBMISSION_FILE>.json /path/to/DGEB/leaderboard/submissions/<HF_MODEL_NAME>/
```

4. Update your fork with the new submission:

```bash
git add leaderboard/submissions/<HF_MODEL_NAME>/<SUBMISSION_FILE>.json
git commit -m "Add submission for <HF_MODEL_NAME>"
git push
```

5. Open a pull request to the main branch of the repository via the Github interface.

6. Once the PR is review and merged, your submission will be added to the leaderboard!


## Acknowledgements

DGEB follows the design of text embedding bechmark [MTEB](https://github.com/embeddings-benchmark/mteb) developed by Huggingface 🤗. The evaluation code is adapted from the MTEB codebase.

## Citing

DGEB was introduced in "[Diverse Genomic Embedding Benchmark for Functional Evaluation Across the Tree of Life]()", feel free to cite:

TODO