AbCDR-ESMC: Antibody ESMC Paired Model
Model Description
This model is a fine-tuned version of ESMC-600M (ESM Cambrian) for paired antibody sequences (heavy and light chains).
Key Features:
- Trained on paired antibody sequences
- 50% CDR fine-tuning
- Input format: Heavy-Light chains separated by "-"
- Output: 1152-dimensional embeddings
- Optimized for antibody CDR region understanding
Preprocessing
Sequences were:
- Combined as: HEAVY-LIGHT (with "-" separator)
- Uncommon amino acids replaced with X
- Tokenized with ESMC tokenizer
- CDR regions annotated for masking
Installation & Requirements
pip install torch
pip install safetensors
pip install huggingface_hub
pip install esm==3.1.4
Usage
Loading the Model
import os
import torch
from huggingface_hub import hf_hub_download
from esm.tokenization import get_esmc_model_tokenizers
from esm.models.esmc import ESMC
from safetensors import safe_open
# Configuration
REPO_ID = "NOC-Lab/AbCDR-ESMC"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load tokenizer and base model
tokenizer = get_esmc_model_tokenizers()
model = ESMC.from_pretrained("esmc_600m").to(device)
# Download fine-tuned weights
local_ckpt_path = hf_hub_download(
repo_id=REPO_ID,
filename="model.safetensors",
token=os.getenv("HF_TOKEN", None) # For private repos
)
# Load and rename state dict
original_state_dict = {}
with safe_open(local_ckpt_path, framework="pt") as sf:
for key in sf.keys():
original_state_dict[key] = sf.get_tensor(key)
# Remove "esmC_model." prefix
renamed_state_dict = {}
for key, value in original_state_dict.items():
new_key = key.replace("esmC_model.", "") if key.startswith("esmC_model.") else key
renamed_state_dict[new_key] = value
# Load weights
model.load_state_dict(renamed_state_dict, strict=False)
model.eval()
Extract Embeddings - Method 1 (High-Level API)
from esm.sdk.api import ESMProtein, LogitsConfig
SEP_TOKEN = "-"
# Example sequences
heavy_chain = (
"EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRF"
"TISADTSKNTAYLQMNSLRAEDTAVYYCAREGYYGSSYWYFDYWGQGTLVTVSS"
)
light_chain = (
"DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGS"
"GTDFTLTISSLQPEDFATYYCQQSYSTPLTFGGGTKVEIK"
)
# Combine with separator
paired_sequence = f"{heavy_chain}{SEP_TOKEN}{light_chain}"
# Create protein object and encode
protein = ESMProtein(sequence=paired_sequence)
protein_tensor = model.encode(protein)
# Get embeddings
logits_output = model.logits(
protein_tensor,
LogitsConfig(sequence=True, return_embeddings=True)
)
embeddings = logits_output.embeddings # Shape: (1, seq_len, 1152)
logits = logits_output.logits.sequence # Shape: (1, seq_len, 64)
print(f"Embeddings shape: {embeddings.shape}") # (1, L, 1152)
print(f"Embeddings dtype: {embeddings.dtype}") # float32
Extract Embeddings - Method 2 (Low-Level Direct)
# Tokenize sequence
seq_encoded = tokenizer(paired_sequence, return_tensors="pt")
seq_input_ids = seq_encoded["input_ids"].to(device)
# Forward pass
with torch.no_grad():
outputs = model(sequence_tokens=seq_input_ids)
embeddings_direct = outputs.embeddings # Shape: (1, seq_len, 1152)
logits_direct = outputs.sequence_logits # Shape: (1, seq_len, 64)
print(f"Embeddings shape: {embeddings_direct.shape}") # (1, L, 1152)
print(f"Embeddings dtype: {embeddings_direct.dtype}") # bfloat16
Mean Pooling for Fixed-Size Representation
# Mean pooling over sequence length
sequence_representation = embeddings_direct.mean(dim=1) # (1, 1152)
print(f"Pooled embedding shape: {sequence_representation.shape}")
# Get interface embedding (at separator position)
separator_pos = len(heavy_chain)
interface_embedding = embeddings_direct[0, separator_pos, :] # (1152,)
Batch Processing
# Multiple sequences
sequences = [
f"{heavy_chain}{SEP_TOKEN}{light_chain}",
f"{heavy_chain[:100]}{SEP_TOKEN}{light_chain[:100]}",
]
# Tokenize with padding
batch_encoded = tokenizer(sequences, return_tensors="pt", padding=True)
batch_input_ids = batch_encoded["input_ids"].to(device)
# Forward pass
with torch.no_grad():
batch_outputs = model(sequence_tokens=batch_input_ids)
batch_embeddings = batch_outputs.embeddings # (batch_size, max_seq_len, 1152)
print(f"Batch embeddings shape: {batch_embeddings.shape}")
Input Format
Required Format: HEAVY_CHAIN-LIGHT_CHAIN
- Heavy and light chains must be separated by hyphen (
-) - Use standard single-letter amino acid codes
- No spaces in sequence
Example:
sequence = "EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMS...-DIQMTQSPSSLSASVGDRVTITCRASQSISS..."
Output
Embeddings
- Dimension: 1152 (ESMC hidden size)
- Sequence length: Variable (up to model's max length)
- Format: PyTorch tensor
- Dtype:
- High-level API: float32
- Low-level API: bfloat16
Logits
- Dimension: 64 (ESMC vocabulary size)
- Format: PyTorch tensor
- Dtype: bfloat16
Citation
If you use this model, please cite:
@article{talaei2025preferential,
title={Preferential CDR masking in paired antibody language models improves binding affinity prediction},
author={Talaei, Mahtab and Walker, Kenji C. and Hao, Boran and Jolley, Eliot and Jin, Yeping and Kozakov, Dima and Misasi, John and Vajda, Sandor and Paschalidis, Ioannis Ch. and Joseph-McCarthy, Diane},
journal={bioRxiv},
year={2025},
doi={10.1101/2025.10.31.685149}
}
@article{hayes2025simulating,
title={Simulating 500 million years of evolution with a language model},
author={Hayes, Thomas and Rao, Roshan and Akin, Halil and Sofroniew, Nicholas J. and Oktay, Deniz and Lin, Zeming and Verkuil, Robert and Tran, Vincent Q. and Deaton, Jonathan and Wiggert, Marius and Badkundri, Rohil and Shafkat, Irhum and Gong, Jun and Derry, Alexander and Molina, Raul S. and Thomas, Neil and Khan, Yousuf A. and Mishra, Chetan and Kim, Carolyn and Bartie, Liam J. and Nemeth, Matthew and Hsu, Patrick D. and Sercu, Tom and Candido, Salvatore and Rives, Alexander},
journal={Science},
volume={387},
number={6736},
pages={850--858},
year={2025},
doi={10.1126/science.ads0018}
}
@misc{esm2024cambrian,
author={{ESM Team}},
title={ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning},
year={2024},
publisher={EvolutionaryScale},
url={https://evolutionaryscale.ai/blog/esm-cambrian}
}
Contact
- Maintainer: Network Optimization & Control (NOC) Lab
- Email: mtalaei@bu.edu
- GitHub: https://github.com/Mah-Tala/AbCDR-ESM
- Paper: bioRxiv preprint
License
This model is released under the MIT License.
Acknowledgments
- Base model: ESMC (ESM Cambrian) by EvolutionaryScale
- Data: OAS database
Note: For private repositories, you'll need to authenticate:
# Option 1: CLI login
huggingface-cli login
# Option 2: Environment variable
export HF_TOKEN="your_token_here"
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for NOC-Lab/AbCDR-ESMC
Base model
biohub/esmc-600m-2024-12