AbCDR-ESMC: Antibody ESMC Paired Model

Model Description

This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).

Key Features:

Trained on paired antibody sequences
50% CDR fine-tuning
Input format: Heavy-Light chains separated by "-"
Output: 1152-dimensional embeddings
Optimized for antibody CDR region understanding

Preprocessing

Sequences were:

Combined as: HEAVY-LIGHT (with "-" separator)
Uncommon amino acids replaced with X
Tokenized with ESMC tokenizer
CDR regions annotated for masking

Installation & Requirements

pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4

Usage

Loading the Model

import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open

# Configuration
REPO_ID = "NOC-Lab/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)

# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
    repo_id=REPO_ID,
    filename="model.safetensors",
    token=os.getenv("HF_TOKEN", None)  # For private repos
)

# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
    for key in sf.keys():
        original_state_dict[key] = sf.get_tensor(key)

# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
    new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
    renamed_state_dict[new_key] = value

# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()

Extract Embeddings - Method 1 (High-Level API)

from esm.sdk.api import ESMProtein, LogitsConfig

SEP_TOKEN = "-"

# Example sequences
heavy_chain = (
    "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
    "TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
    "DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
    "GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)

# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"

# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)

# Get embeddings
logits_output = model.logits(
    protein_tensor,
    LogitsConfig(sequence=True, return_embeddings=True)
)

embeddings = logits_output.embeddings  # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}")  # float32

Extract Embeddings - Method 2 (Low-Level Direct)

# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    outputs = model(sequence_tokens=seq_input_ids)

embeddings_direct = outputs.embeddings  # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits  # Shape: (1, seq_len, 64)

print(f"Embeddings shape: {embeddings_direct.shape}")  # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}")  # bfloat16

Mean Pooling for Fixed-Size Representation

# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1)  # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")

# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :]  # (1152,)

Batch Processing

# Multiple sequences
sequences = [
    f"{heavy_chain}{SEP_TOKEN}{light_chain}",
    f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]

# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)

# Forward pass
with torch.no_grad():
    batch_outputs = model(sequence_tokens=batch_input_ids)

batch_embeddings = batch_outputs.embeddings  # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")

Input Format

Required Format: HEAVY_CHAIN-LIGHT_CHAIN

Heavy and light chains must be separated by hyphen (-)
Use standard single-letter amino acid codes
No spaces in sequence

Example:

sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."

Output

Embeddings

Dimension: 1152 (ESMC hidden size)
Sequence length: Variable (up to model's max length)
Format: PyTorch tensor
Dtype:
- High-level API: float32
- Low-level API: bfloat16

Logits

Dimension: 64 (ESMC vocabulary size)
Format: PyTorch tensor
Dtype: bfloat16

Citation

If you use this model, please cite:

@article{talaei2025preferential,
  title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
  author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.10.31.685149}
}

@article{hayes2025simulating,
  title={Simulating 500 million years of evolution with a language model},
  author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J. and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q. and Deaton, Jonathan and Wiggert, Marius and Badkundri, Rohil and Shafkat, Irhum and Gong, Jun and Derry, Alexander and Molina, Raul S. and Thomas, Neil and Khan, Yousuf A. and Mishra, Chetan and Kim, Carolyn and Bartie, Liam J. and Nemeth, Matthew and Hsu, Patrick D. and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
  journal={Science},
  volume={387},
  number={6736},
  pages={850--858},
  year={2025},
  doi={10.1126/science.ads0018}
}

@misc{esm2024cambrian,
  author={{ESM Team}},
  title={ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning},
  year={2024},
  publisher={EvolutionaryScale},
  url={https://evolutionaryscale.ai/blog/esm-cambrian}
}

Contact

Maintainer: Network Optimization & Control (NOC) Lab
Email: mtalaei@bu.edu
GitHub: https://github.com/Mah-Tala/AbCDR-ESM
Paper: bioRxiv preprint

License

This model is released under the MIT License.

Acknowledgments

Base model: ESMC (ESM Cambrian) by EvolutionaryScale
Data: OAS database

Note: For private repositories, you'll need to authenticate:

# Option 1: CLI login
huggingface-cli login

# Option 2: Environment variable
export HF_TOKEN="your_token_here"

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.6B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NOC-Lab/AbCDR-ESMC

Base model

biohub/esmc-600m-2024-12

Finetuned

(4)

this model