Antibody ESM2 Paired Model

Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

Key Features:

Trained on paired antibody sequences
15% WC followed by 50% CDR fine-tuning
Input format: Heavy-Light chains separated by "-"
Output: 2560-dimensional embeddings
Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

Combined as: HEAVY-LIGHT (with "-" separator)
Tokenized with ESM2 tokenizer
CDR regions annotated for masking

Usage

Loading the Model

from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2")
model.eval()

Extract Embeddings

# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

Heavy and light chains must be separated by hyphen (-)
Use standard single-letter amino acid codes
No spaces in sequence
Uncommon residues should be replaced with X

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embedding dimension: 2560
Sequence length: Variable (up to ~1024 tokens including special tokens)
Format: PyTorch tensor

Citation

If you use this model, please cite:

@article{talaei2025preferential,
  title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
  author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.10.31.685149}
}

Contact

Maintainer: Network Optimization & Control (NOC) Lab
Email: mtalaei@bu.edu
GitHub: https://github.com/Mah-Tala/AbCDR-ESM
Paper: bioRxiv preprint

License

This model is released under the MIT License.

Acknowledgments

Base model: ESM2 by Meta AI
Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"

Downloads last month: 3

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NOC-Lab/AbCDR-ESM2

Base model

facebook/esm2_t36_3B_UR50D

Finetuned

(2)

this model