YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Empathi
Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment

Table of Contents

About the Project
Getting Started
- Prerequisites
- Installation
Usage details

About the Project

Empathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5 protein embeddings to make predictions. In addition, new functional groups were defined to be better suited for machine-learning than the often-overlapping PHROG categories.

Citation: Boulay, A., Leprince, A., Enault, F. et al. Empathi: embedding-based phage protein annotation tool by hierarchical assignment. Nat Commun 16, 9114 (2025). https://doi.org/10.1038/s41467-025-64177-5.

Getting Started

Empathi has been packaged in PyPI and as an Apptainer container for ease of use.
The source code can also be downloaded from HuggingFace.

Prerequisites

A GPU is recommended for large datasets.

The full list of dependencies and versions can be found in requirements.txt.

Either git-lfs or Apptainer will be required. See instructions below.

Other dependencies are taken care of by pip and Apptainer.

python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.5.0
transformers==4.43.1
sentencepiece==0.2.0

Installation

There are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code. Installation should take less than 10 minutes. A small fasta file is provided to test installation. This should run in <1 minute.

1. PIP

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Download models for Empathi. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
export PATH="/path/to/empathi/models:$PATH"

Install dependencies:

pip install empathi

Usage

empathi input_file name

2. Apptainer

Download Apptainer or singularity. On windows, this will require a virtual machine. WSL works well.

Fetch Empathi from Sylabs:

apptainer pull empathi.sif library://alexandreboulay/empathi/empathi

Usage

apptainer run empathi.sif path/to/input_file name --confidence 0.95

3. From source code

First, create a virtual environment in python 3.11.5.

conda create -n empathi_env python=3.11.5
conda activate empathi_env

Clone the repo. You will need git-lfs: for WSL or linux use sudo apt-get install git-lfs, for windows either use git bash or get it from here. Then:

git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi

Install dependencies:

cd empathi
pip install -r requirements.txt

Usage

python src/empathi/empathi.py input_file name

Usage details

A fasta file of protein sequences or a csv file of protein embeddings can be used as input.

By default, a confidence >0.95 is used to assign functions. This is well suited for metagenomic datasets. Using a high confidence threshold (--confidence 0.95) will result in more precise predictions (lower false positive rate), but also a lower sensitivity (less proteins assigned a function). If your objective is to annotate as many proteins as possible, consider using a confidence threshold of 0.5 (or somewhere in between), especially if working with cultured phages.

Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU. The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.

Options:

input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).
name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.
--models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.
--only_embeddings: Whether to only calculate embeddings (no functional prediction).
--output_folder: Path to the output folder. Default is ./empathi_out/.
--threads: Number of threads (default 1).
--confidence: Confidence threshold used to assign predictions (default 0.95).
--mode: Which types of proteins you want to predict. Accepted arguments are "all", "pvp", "DNA-associated", "adsorption-related", "lysis", "regulator", "cell_wall_depolymerase", "packaging", "RNA-associated", "ejection", "phosphorylation", "transferase", "nucleotide_metabolism", "reductase" and "defense_systems".

Output format

The output consists of a csv file with an annotation column regrouping all assigned annotations per protein (separated by "|") and a column per functional category with the confidence associated to each prediction.

Ex.

Annotation	PVP	cell wall depolymerase	DNA-associated	...
PVP\|cell wall depolymerase	0.98	0.99	0.005	...
DNA-associated	0.01	0.05	0.998	...

Hierarchical classification

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support