n0w0f
/

mattext-aligned-embeddings

Model card Files Files and versions

xet

Community

n0w0f commited on 14 days ago

Commit

283f249

verified ·

1 Parent(s): 7949a14

Update README for v2: NL queries, 1024 ctx, LaCLIP architecture

Browse files

Files changed (1) hide show

README.md +158 -138

README.md CHANGED Viewed

@@ -1,111 +1,139 @@
-# MatText Aligned Embeddings: Multi-Modal Material Retrieval
-**A CLIP-style multi-modal embedding model that aligns 10 different material text representations into a shared 128-d vector space for cross-modal retrieval.**
-Query with *any* modality (composition, CIF, SLICES, natural language, z-matrix...) → retrieve materials with similar properties across *all* modalities.
 ## 🏗️ Architecture
 ```
-┌─────────────────────────────────────────────────────────────┐
-│                    MatTextEncoder                           │
-│                                                             │
-│  ┌───────────────────────────────────────────────────────┐ │
-│  │  Shared Backbone: ModernBERT-base (150M params)       │ │
-│  │  - 8192 token context window (handles long CIFs)      │ │
-│  │  - Mean pooling → 768-d representation                │ │
-│  └───────────────────────────────────────────────────────┘ │
-│                          │                                  │
-│          ┌───────────────┼───────────────┐                 │
-│          ▼               ▼               ▼                 │
-│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐       │
-│  │ Projection   │ │ Projection   │ │ Projection   │  ...  │
-│  │ composition  │ │ cif_sym      │ │ slices       │       │
-│  │ 768→768→128  │ │ 768→768→128  │ │ 768→768→128  │       │
-│  └──────┬───────┘ └──────┬───────┘ └──────┬───────┘       │
-│         ▼                ▼                ▼                │
-│     128-d L2-norm    128-d L2-norm    128-d L2-norm        │
-│                                                             │
-│              ──── Shared Embedding Space ────                │
-└─────────────────────────────────────────────────────────────┘
 ```
-### Key Design Decisions
-| Decision | Choice | Rationale |
-|----------|--------|-----------|
-| Backbone | ModernBERT-base | 8192 ctx handles long CIFs; fast RoPE attention |
-| Projection | 2-layer MLP per modality | MultiMat recipe: modality-specific heads preserve specialization |
-| Embedding dim | 128 | Standard for contrastive learning; compact for FAISS |
-| Loss | AllPairsCLIP + Property-MSE | Aligns all N(N-1)/2 modality pairs; property regularization |
-| Temperature | Learnable (init 0.07) | CLIP standard; learned τ improves convergence |
-## 📊 Modalities Supported
-| Modality | Column | Example | Query Type |
-|----------|--------|---------|------------|
-| Composition | `composition` | `Fe2O3` | "Find iron oxides" |
-| Atom Sequence | `atom_sequences` | `Fe Fe Fe O O O` | Element lists |
-| CIF (symmetrized) | `cif_symmetrized` | Full CIF text | Paste CIF data |
-| CIF (P1) | `cif_p1` | Full CIF in P1 | Paste CIF data |
-| Z-matrix | `zmatrix` | `Fe\nO 1 2.0\nO 1 2.0 2 90` | Internal coords |
-| Atom Seq++ | `atom_sequences_plusplus` | `Fe O 3.57 3.57 90 90` | Elements + lattice |
-| SLICES | `slices` | `Fe O 0 1 o o o` | SLICES encoding |
-| Crystal Text (LLM) | `crystal_text_llm` | `3.6 3.6 3.6\n90 90 90\nFe...` | Gruver format |
-| Local Environment | `local_env` | SMILES-like env | Local bonding |
-| Natural Language | `robocrys_rep` | "FeO crystallizes in..." | Plain English |
-| **Property Query** | property text | "bandgap: 1.5 eV" | Property search |
-## 🧪 Training Recipe
-Based on three key papers:
-1. **MultiMat** (AllPairsCLIP, [arxiv:2312.00111](https://arxiv.org/abs/2312.00111)): Sum of symmetric InfoNCE over all modality pairs
-2. **MatExpert** ([arxiv:2410.21317](https://arxiv.org/abs/2410.21317)): Property↔structure contrastive alignment
-3. **CrystalCLR** ([arxiv:2211.13408](https://arxiv.org/abs/2211.13408)): Composition similarity loss
-4. **SupReMix** ([arxiv:2309.16633](https://arxiv.org/abs/2309.16633)): Property-label-aware soft contrastive
 ### Two-Phase Training
-**Phase 1 — Multi-modal alignment** (pretrain100k_v2, 50k samples):
-- AllPairsCLIP loss across all 10 modalities
-- Random modality sampling (4/10 per step) for VRAM efficiency
-- Each step aligns C(4,2)=6 modality pairs
-**Phase 2 — Property-conditioned alignment** (bandgap + form_energy, 50k samples):
-- Same CLIP loss + property similarity MSE loss
-- Property text "composition: Fe2O3 | bandgap: 2.1000" aligned with structure representations
-- Materials with similar property values cluster in embedding space
 ### Hyperparameters
-```
 encoder: answerdotai/ModernBERT-base
 embed_dim: 128
-max_length: 512 tokens
-batch_size: 32 × 8 grad_accum = 256 effective
-learning_rate: 2e-5 (cosine decay, 10% warmup)
 temperature: learnable (init 0.07)
 epochs: 3 per phase
 optimizer: AdamW (weight_decay=0.01)
-fp16: True
 gradient_checkpointing: True
 ```
 ## 🚀 Quick Start
-### Training
 ```bash
-pip install torch transformers datasets faiss-cpu huggingface_hub trackio
-# Local GPU
-python train_mattext_embeddings.py
-# HF Jobs (recommended: a10g-large, 24GB VRAM)
-# Set timeout to 6h
 ```
 ### Inference & Search
 ```python
@@ -113,119 +141,111 @@ import torch
 import faiss
 import json
 import numpy as np
-from transformers import AutoModel, AutoTokenizer
-# Load model
 from train_mattext_embeddings import MatTextEncoder, Config, search_vector_db
 config = Config()
 config.device = "cuda" if torch.cuda.is_available() else "cpu"
 model = MatTextEncoder(config)
 model.load_state_dict(torch.load("mattext-embeddings/model.pt", map_location=config.device))
-model = model.to(config.device)
-model.eval()
 tokenizer = AutoTokenizer.from_pretrained(config.encoder_name)
 # Load FAISS indices
 indices = {}
-for mod in ["composition", "crystal_text_llm", "slices", "cif_symmetrized"]:
     index = faiss.read_index(f"mattext-embeddings/faiss/{mod}.index")
     with open(f"mattext-embeddings/faiss/{mod}_metadata.json") as f:
         metadata = json.load(f)
     indices[mod] = {"index": index, "metadata": metadata}
-# Search!
-results = search_vector_db("Fe2O3", "composition", model, tokenizer, indices, config, k=5)
-for score, meta in results:
-    print(f"Score: {score:.4f} | {meta['composition']}")
 ```
-### Cross-Modal Query Examples
 ```python
-# Query by composition → find across all modalities
-search_vector_db("SiO2", "composition", model, tokenizer, indices, config)
-# Query by natural language → find materials
-search_vector_db("perovskite with high bandgap", "robocrys_rep", model, tokenizer, indices, config)
-# Query by SLICES representation
-search_vector_db("Si O 0 1 o o o", "slices", model, tokenizer, indices, config)
-# Query by CIF data
-search_vector_db("data_SiO2\n_symmetry P1\n...", "cif_symmetrized", model, tokenizer, indices, config)
-# Property-conditioned query
-search_vector_db("composition: Si | bandgap: 1.1200", "property", model, tokenizer, indices, config)
 ```
-## 🔬 Evaluation Metrics
-Cross-modal Recall@k: for each material, embed in modality A, retrieve in modality B, check if correct match is in top-k.
-| Pair | R@1 | R@5 | R@10 |
-|------|-----|-----|------|
-| composition → crystal_text_llm | TBD | TBD | TBD |
-| composition → cif_symmetrized | TBD | TBD | TBD |
-| slices → crystal_text_llm | TBD | TBD | TBD |
-| robocrys_rep → composition | TBD | TBD | TBD |
 *Results populated after training.*
 ## 🧩 Extending: Graph Embeddings
-The architecture supports adding graph neural network (GNN) embeddings:
 ```python
-# Add a GNN projection head
-from torch_geometric.nn import SchNet, DimeNet  # or CGCNN
 class GraphEncoder(nn.Module):
     def __init__(self, embed_dim=128):
         super().__init__()
-        self.gnn = SchNet(hidden_channels=256, num_filters=128, num_interactions=6)
         self.proj = ModalityProjection(256, embed_dim)
     def forward(self, data):
-        # data: PyG Data with pos, z (atomic numbers), batch
         h = self.gnn(data.z, data.pos, data.batch)
         return self.proj(h)
-# Add to MatTextEncoder:
-model.graph_encoder = GraphEncoder(config.embed_dim)
-model.projections["graph"] = model.graph_encoder.proj
-# Training: treat graph embeddings as another modality in AllPairsCLIP
 ```
-For graph embeddings, convert CIF → PyG Data (using `pymatgen` + `torch_geometric`):
-```python
-from pymatgen.core import Structure
-from torch_geometric.data import Data
-def cif_to_graph(cif_string, cutoff=5.0):
-    struct = Structure.from_str(cif_string, fmt="cif")
-    # Get neighbors within cutoff
-    neighbors = struct.get_all_neighbors(cutoff)
-    # Build edge_index, pos, z ...
-    return Data(z=atomic_numbers, pos=positions, edge_index=edge_index)
-```
 ## 📚 References
-- **MatText**: [arxiv:2406.17295](https://arxiv.org/abs/2406.17295) — Dataset and text representations
-- **MultiMat**: [arxiv:2312.00111](https://arxiv.org/abs/2312.00111) — AllPairsCLIP for materials
-- **MatExpert**: [arxiv:2410.21317](https://arxiv.org/abs/2410.21317) — Property↔structure alignment
-- **CrystalCLR**: [arxiv:2211.13408](https://arxiv.org/abs/2211.13408) — Contrastive learning for crystals
-- **SupReMix**: [arxiv:2309.16633](https://arxiv.org/abs/2309.16633) — Property-aware hard negatives
-- **Symile**: [arxiv:2411.01053](https://arxiv.org/abs/2411.01053) — Total-correlation loss for M modalities
 ## 📄 License
 MIT
-## 🔗 Dataset
-[n0w0f/MatText](https://huggingface.co/datasets/n0w0f/MatText) — 100k+ crystal structures in 10 text representations

+# MatText Aligned Embeddings v2: Multi-Modal Material Retrieval with Natural Language Queries
+**A CLIP-style multi-modal embedding model that aligns 10+ material text representations into a shared 128-d vector space. Query with natural language ("oxide with high bandgap"), composition, CIF, SLICES, or any modality → retrieve matching materials.**
+## 🆕 v2 Key Features
+| Feature | v1 | v2 |
+|---------|----|----|
+| Context length | 512 tokens | **1024 tokens** (captures long CIFs) |
+| Natural language queries | ❌ | **✅ "oxide with high bandgap"** |
+| Property-aware retrieval | Basic | **LaCLIP-style diverse NL descriptions** |
+| GPU optimization | fp16 / 24GB | **bf16 / 80GB A100 optimized** |
+| Effective batch size | 256 | **288** |
+| Modalities per step | 4 | **5** |
+| Flash Attention 2 | ❌ | **✅ (auto-detect)** |
 ## 🏗️ Architecture
 ```
+┌────────────────────────────────────────────────────────────────────────┐
+│                         MatTextEncoder (157M params)                   │
+│                                                                        │
+│  ┌──────────────────────────────────────────────────────────────────┐ │
+│  │  Shared Backbone: ModernBERT-base (150M params, 8192 ctx)        │ │
+│  │  Mean pooling → 768-d representation                             │ │
+│  │  Gradient checkpointing + bf16                                   │ │
+│  └──────────────────────────────────────────────────────────────────┘ │
+│                               │                                        │
+│     ┌─────────────┬──────────┴──────────┬──────────────┐              │
+│     ▼             ▼                     ▼              ▼              │
+│ ┌─────────┐ ┌──────────┐ ┌───────────────────┐ ┌─────��────┐         │
+│ │comp     │ │cif_sym   │ │nl_property_desc   │ │property  │  ...×12 │
+│ │768→768  │ │768→768   │ │768→768→128        │ │768→768   │         │
+│ │→128     │ │→128      │ │"oxide with high   │ │→128      │         │
+│ │         │ │          │ │ bandgap" queries   │ │          │         │
+│ └────┬────┘ └────┬─────┘ └────────┬──────────┘ └────┬─────┘         │
+│      ▼           ▼                ▼                  ▼               │
+│  128-d L2     128-d L2        128-d L2           128-d L2            │
+│                                                                        │
+│              ──── Shared 128-d Embedding Space ────                    │
+│     (FAISS IndexFlatIP for cosine similarity search)                  │
+└────────────────────────────────────────────────────────────────────────┘
 ```
+### 12 Projection Heads
+| # | Head | Input | Purpose |
+|---|------|-------|---------|
+| 1 | `composition` | "Fe2O3" | Formula queries |
+| 2 | `atom_sequences` | "Fe Fe O O O" | Element list queries |
+| 3 | `cif_symmetrized` | Full CIF | Paste CIF data |
+| 4 | `cif_p1` | CIF in P1 | P1 space group CIF |
+| 5 | `zmatrix` | Z-matrix coords | Internal coordinates |
+| 6 | `atom_sequences_plusplus` | Elements + lattice | Atom sequence + cell |
+| 7 | `slices` | SLICES encoding | Compact structure encoding |
+| 8 | `crystal_text_llm` | Gruver format | Lattice + coords text |
+| 9 | `local_env` | SMILES-like env | Local bonding environment |
+| 10 | `robocrys_rep` | NL description | "FeO crystallizes in..." |
+| 11 | **`nl_property_description`** | **Free-form NL** | **"oxide with high bandgap"** |
+| 12 | `property` | Structured props | "bandgap: 2.1 eV" |
+## 🔍 How NL Queries Work
+The key innovation is a **LaCLIP-style** training approach ([arxiv:2305.20088](https://arxiv.org/abs/2305.20088)):
+1. **During Phase 2 training**, for each material with known properties (bandgap, formation energy), we generate **diverse natural language descriptions** from templates:
+   - `"A wide bandgap oxide suitable for UV applications, bandgap 3.20 eV"`
+   - `"TiO2: oxide semiconductor with wide band gap of 3.20 electron volts"`
+   - `"This binary oxide (TiO2) exhibits a wide bandgap of approximately 3.20 eV"`
+2. These NL descriptions are passed through a **dedicated `nl_property_description` projection head** and aligned with ALL structure modalities via InfoNCE.
+3. **At inference**, when you query `"oxide with high bandgap"`, the model maps it through the same NL head into the shared embedding space, and FAISS finds the nearest materials — those that were trained to be close to similar descriptions.
+This is distinct from `robocrys_rep` (which describes crystal *structure*: "FeO crystallizes in the rock salt structure..."). The NL query head describes *properties* ("wide bandgap oxide").
+## 🧪 Training Recipe
 ### Two-Phase Training
+**Phase 1 — Multi-modal alignment** (pretrain100k_v2, 60k samples, 3 epochs):
+- AllPairsCLIP loss across 10 modalities
+- Random modality sampling (5/10 per step) — always includes composition + crystal_text_llm
+- Effective batch 288
+**Phase 2 — Property-conditioned + NL query alignment** (bandgap + formation_energy, 60k samples, 3 epochs):
+- AllPairsCLIP loss (structure modalities)
+- **NL description ↔ structure InfoNCE** (the key NL query loss)
+- Property ↔ composition/crystal_text_llm InfoNCE ([MatExpert](https://arxiv.org/abs/2410.21317))
+- SupReMix-style property similarity MSE ([arxiv:2309.16633](https://arxiv.org/abs/2309.16633))
+- Loss weights: `L = L_clip + 0.3 * L_property + 0.5 * L_nl`
+### Based On
+| Paper | Contribution | ArXiv |
+|-------|-------------|-------|
+| **MultiMat** | AllPairsCLIP loss | [2312.00111](https://arxiv.org/abs/2312.00111) |
+| **MatExpert** | Property↔structure InfoNCE | [2410.21317](https://arxiv.org/abs/2410.21317) |
+| **LaCLIP** | LLM text augmentation for CLIP | [2305.20088](https://arxiv.org/abs/2305.20088) |
+| **SupReMix** | Property-label-aware soft contrastive | [2309.16633](https://arxiv.org/abs/2309.16633) |
+| **CrystalCLR** | Composition similarity | [2211.13408](https://arxiv.org/abs/2211.13408) |
 ### Hyperparameters
+```yaml
 encoder: answerdotai/ModernBERT-base
 embed_dim: 128
+max_length: 1024 tokens
+batch_size: 48 × 6 grad_accum = 288 effective
+learning_rate: 2e-5 (phase 1), 1e-5 (phase 2)
 temperature: learnable (init 0.07)
 epochs: 3 per phase
 optimizer: AdamW (weight_decay=0.01)
+precision: bf16 (A100) / fp16 (T4/V100)
 gradient_checkpointing: True
+max_modalities_per_step: 5
 ```
 ## 🚀 Quick Start
+### Training (your GPU)
 ```bash
+pip install torch transformers datasets faiss-cpu huggingface_hub trackio accelerate
+# Optional but recommended for A100/H100:
+pip install flash-attn --no-build-isolation
+python train_mattext_embeddings.py
 ```
+The script auto-detects:
+- GPU capability (bf16 for Ampere+, fp16 otherwise)
+- Flash Attention 2 availability
+- CUDA vs CPU
 ### Inference & Search
 ```python
 import faiss
 import json
 import numpy as np
+from transformers import AutoTokenizer
 from train_mattext_embeddings import MatTextEncoder, Config, search_vector_db
+# Load
 config = Config()
 config.device = "cuda" if torch.cuda.is_available() else "cpu"
 model = MatTextEncoder(config)
 model.load_state_dict(torch.load("mattext-embeddings/model.pt", map_location=config.device))
+model = model.to(config.device).eval()
 tokenizer = AutoTokenizer.from_pretrained(config.encoder_name)
 # Load FAISS indices
 indices = {}
+for mod in ["composition", "crystal_text_llm", "slices", "cif_symmetrized", "robocrys_rep"]:
     index = faiss.read_index(f"mattext-embeddings/faiss/{mod}.index")
     with open(f"mattext-embeddings/faiss/{mod}_metadata.json") as f:
         metadata = json.load(f)
     indices[mod] = {"index": index, "metadata": metadata}
 ```
+### Query Examples
 ```python
+# 🔍 Natural language property queries (THE KEY FEATURE)
+search_vector_db("oxide with high bandgap", "nl_property_description", model, tokenizer, indices, config)
+search_vector_db("stable ternary nitride", "nl_property_description", model, tokenizer, indices, config)
+search_vector_db("narrow bandgap semiconductor for IR", "nl_property_description", model, tokenizer, indices, config)
+search_vector_db("metallic binary compound", "nl_property_description", model, tokenizer, indices, config)
+# 🧪 Composition queries
+search_vector_db("Fe2O3", "composition", model, tokenizer, indices, config)
+search_vector_db("BaTiO3", "composition", model, tokenizer, indices, config)
+# 📖 Structure description queries
+search_vector_db("perovskite with octahedral coordination", "robocrys_rep", model, tokenizer, indices, config)
+# 📊 Structured property queries
+search_vector_db("composition: TiO2 | bandgap: 3.2000", "property", model, tokenizer, indices, config)
+# 🔬 CIF queries (paste your CIF)
+search_vector_db("data_TiO2\n_symmetry P1\n_cell 4.59 4.59 2.96 90 90 90", "cif_symmetrized", ...)
+# 🧬 SLICES queries
+search_vector_db("Ti O 0 1 o o o", "slices", model, tokenizer, indices, config)
 ```
+## 📊 Evaluation Metrics
+Cross-modal Recall@k on test set:
+| Pair | R@1 | R@5 | R@10 | R@20 |
+|------|-----|-----|------|------|
+| composition → crystal_text_llm | TBD | TBD | TBD | TBD |
+| composition → cif_symmetrized | TBD | TBD | TBD | TBD |
+| composition → slices | TBD | TBD | TBD | TBD |
+| slices → crystal_text_llm | TBD | TBD | TBD | TBD |
+| robocrys_rep → composition | TBD | TBD | TBD | TBD |
+NL Query Results:
+| Query | Top-1 Match | Score |
+|-------|------------|-------|
+| "oxide with high bandgap" | TBD | TBD |
+| "narrow bandgap semiconductor" | TBD | TBD |
+| "stable binary oxide" | TBD | TBD |
 *Results populated after training.*
 ## 🧩 Extending: Graph Embeddings
+The architecture is plug-and-play for new modalities:
 ```python
+# Add a GNN modality
+from torch_geometric.nn import SchNet
 class GraphEncoder(nn.Module):
     def __init__(self, embed_dim=128):
         super().__init__()
+        self.gnn = SchNet(hidden_channels=256)
         self.proj = ModalityProjection(256, embed_dim)
     def forward(self, data):
         h = self.gnn(data.z, data.pos, data.batch)
         return self.proj(h)
+# Register as new modality
+model.projections["graph"] = graph_encoder.proj
+# It gets aligned automatically through AllPairsCLIP
 ```
+## 📦 Dataset
+[n0w0f/MatText](https://huggingface.co/datasets/n0w0f/MatText) — 100k+ crystal structures in 10+ text representations
 ## 📚 References
+- **MatText**: [arxiv:2406.17295](https://arxiv.org/abs/2406.17295)
+- **MultiMat**: [arxiv:2312.00111](https://arxiv.org/abs/2312.00111)
+- **MatExpert**: [arxiv:2410.21317](https://arxiv.org/abs/2410.21317)
+- **LaCLIP**: [arxiv:2305.20088](https://arxiv.org/abs/2305.20088)
+- **SupReMix**: [arxiv:2309.16633](https://arxiv.org/abs/2309.16633)
+- **CrystalCLR**: [arxiv:2211.13408](https://arxiv.org/abs/2211.13408)
+- **Symile**: [arxiv:2411.01053](https://arxiv.org/abs/2411.01053)
 ## 📄 License
 MIT