guyyanai
/

CLSS

@@ -1,10 +1,98 @@
 ---
 license: apache-2.0
 base_model:
-- facebook/esm2_t12_35M_UR50D
-- EvolutionaryScale/esm3-sm-open-v1
 tags:
-- biology
-- contrastive-learning
-- proteins
----

 ---
 license: apache-2.0
+pipeline_tag: feature-extraction
+library_name: pytorch
 base_model:
+  - facebook/esm2_t12_35M_UR50D
+  - EvolutionaryScale/esm3-sm-open-v1
 tags:
+  - biology
+  - bioinformatics
+  - protein
+  - protein-embeddings
+  - contrastive-learning
+  - multimodal
+  - structure
+  - sequence
+  - sequence-segments
+  - pytorch
+---
+# CLSS (Contrastive Learning Sequence–Structure)
+CLSS is a **self-supervised, two-tower contrastive model** that **co-embeds protein sequences and protein structures into a shared latent space**, enabling unified analysis of protein space across modalities.
+**Links**
+- Hugging Face model repo: https://huggingface.co/guyyanai/CLSS
+- Code + examples (`clss-model`): https://github.com/guyyanai/CLSS
+- Paper (bioRxiv): https://doi.org/10.1101/2025.09.05.674454
+- Interactive CLSS viewer: https://gabiaxel.github.io/clss-viewer/
+---
+## Model description
+### Architecture (high level)
+CLSS follows a **two-tower architecture**:
+- **Sequence tower:** a trainable ESM2-like sequence encoder
+- **Structure tower:** a frozen ESM3 structure encoder
+- Each tower is followed by a lightweight **linear projection head** mapping into a shared embedding space, with **L2-normalized outputs**
+The result is a pair of embeddings (sequence and structure) that live in the **same latent space**, making cosine similarity directly comparable across modalities.
+The paper’s primary configuration uses **32-dimensional embeddings**, but multiple embedding sizes are provided in this repository.
+### Training objective
+CLSS is trained with a **CLIP-style contrastive objective**, aligning:
+- **Random sequence segments**
+- With their corresponding **full-domain protein structures**
+**No** hierarchical labels (e.g. ECOD or CATH) are used during training; structural and evolutionary organization emerges implicitly.
+---
+## Files in this repository
+This Hugging Face repository contains multiple PyTorch Lightning checkpoints, differing only in **embedding dimensionality**:
+- `h8_r10.lckpt`   → 8-dimensional embeddings
+- `h16_r10.lckpt`  → 16-dimensional embeddings
+- `h32_r10.lckpt`  → 32-dimensional embeddings (paper default)
+- `h64_r10.lckpt`  → 64-dimensional embeddings
+- `h128_r10.lckpt` → 128-dimensional embeddings
+---
+## How to use CLSS
+CLSS is intended to be used via the **`clss-model` Python library**, which provides:
+- Model loading from Lightning checkpoints
+- End-to-end inference examples
+- Scripts used for generating interactive protein space maps
+---
+## License
+The CLSS codebase is released under the **Apache 2.0 License**.
+Please consult the repository for details on third-party model dependencies.
+---
+## Citation
+If you use CLSS, please cite:
+```bibtex
+@article{Yanai2025CLSS,
+  title   = {Contrastive learning unites sequence and structure in a global representation of protein space},
+  author  = {Yanai, Guy and Axel, Gabriel and Longo, Liam M. and Ben-Tal, Nir and Kolodny, Rachel},
+  journal = {bioRxiv},
+  year    = {2025},
+  doi     = {10.1101/2025.09.05.674454},
+  url     = {https://doi.org/10.1101/2025.09.05.674454}
+}