Antibody ESM2 Paired Model

Model Description

This model is a fine-tuned version of ESM2-3B for paired antibody sequences (heavy and light chains).

Key Features:

  • Trained on paired antibody sequences
  • 15% WC followed by 50% CDR fine-tuning
  • Input format: Heavy-Light chains separated by "-"
  • Output: 2560-dimensional embeddings
  • Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

  1. Combined as: HEAVY-LIGHT (with "-" separator)
  2. Tokenized with ESM2 tokenizer
  3. CDR regions annotated for masking

Usage

Loading the Model

from transformers import EsmModel, AutoTokenizer
import torch

# Load model and tokenizer
model = EsmModel.from_pretrained("NOC-Lab/AbCDR-ESM2")
tokenizer = AutoTokenizer.from_pretrained("NOC-Lab/AbCDR-ESM2")
model.eval()

Extract Embeddings

# Prepare paired sequence
SEP_TOKEN = "-" 
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Tokenize
inputs = tokenizer(paired, return_tensors="pt", add_special_tokens=True)

# Extract embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1)
pooled = (embeddings * mask).sum(1) / mask.sum(1)

print(f"Embedding shape: {pooled.shape}")  # (1, 2560)

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

  • Heavy and light chains must be separated by hyphen (-)
  • Use standard single-letter amino acid codes
  • No spaces in sequence
  • Uncommon residues should be replaced with X

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

  • Embedding dimension: 2560
  • Sequence length: Variable (up to ~1024 tokens including special tokens)
  • Format: PyTorch tensor

Citation

If you use this model, please cite:

@article{talaei2025preferential,
  title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
  author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.10.31.685149}
}

Contact

License

This model is released under the MIT License.

Acknowledgments

  • Base model: ESM2 by Meta AI
  • Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"
Downloads last month
3
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NOC-Lab/AbCDR-ESM2

Finetuned
(2)
this model