minishlab
/

potion-code-16M

@@ -1,92 +1,401 @@
 ---
-library_name: model2vec
 license: mit
-model_name: static-coderankembed-potion-code-16m-contrastive
 tags:
 - embeddings
 - static-embeddings
-- sentence-transformers
 ---
-# static-coderankembed-potion-code-16m-contrastive Model Card
-This [Model2Vec](https://github.com/MinishLab/model2vec) model is a distilled version of a Sentence Transformer. It uses static embeddings, allowing text embeddings to be computed orders of magnitude faster on both GPU and CPU. It is designed for applications where computational resources are limited or where real-time performance is critical. Model2Vec models are the smallest, fastest, and most performant static embedders available. The distilled models are up to 50 times smaller and 500 times faster than traditional Sentence Transformers.
 ## Installation
-Install model2vec using pip:
-```
 pip install model2vec
 ```
 ## Usage
-### Using Model2Vec
-The [Model2Vec library](https://github.com/MinishLab/model2vec) is the fastest and most lightweight way to run Model2Vec models.
-Load this model using the `from_pretrained` method:
 ```python
 from model2vec import StaticModel
-# Load a pretrained Model2Vec model
-model = StaticModel.from_pretrained("static-coderankembed-potion-code-16m-contrastive")
-# Compute text embeddings
-embeddings = model.encode(["Example sentence"])
 ```
-### Using Sentence Transformers
-You can also use the [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) to load and use the model:
-```python
-from sentence_transformers import SentenceTransformer
-# Load a pretrained Sentence Transformer model
-model = SentenceTransformer("static-coderankembed-potion-code-16m-contrastive")
-# Compute text embeddings
-embeddings = model.encode(["Example sentence"])
-```
-### Distilling a Model2Vec model
-You can distill a Model2Vec model from a Sentence Transformer model using the `distill` method. First, install the `distill` extra with `pip install model2vec[distill]`. Then, run the following code:
 ```python
-from model2vec.distill import distill
-# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model
-m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256)
-# Save the model
-m2v_model.save_pretrained("m2v_model")
-```
-## How it works
-Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Best of all, you don't need any data to distill a model using Model2Vec.
-It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using [SIF weighting](https://openreview.net/pdf?id=SyK00v5xx). During inference, we simply take the mean of all token embeddings occurring in a sentence.
-## Additional Resources
-- [Model2Vec Repo](https://github.com/MinishLab/model2vec)
-- [Model2Vec Base Models](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e)
-- [Model2Vec Results](https://github.com/MinishLab/model2vec/tree/main/results)
-- [Model2Vec Docs](https://minish.ai/packages/model2vec/introduction)
-## Library Authors
-Model2Vec was developed by the [Minish Lab](https://github.com/MinishLab) team consisting of [Stephan Tulkens](https://github.com/stephantul) and [Thomas van Dongen](https://github.com/Pringled).
-## Citation
-Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
 ```
 @software{minishlab2024model2vec,
   author       = {Stephan Tulkens and {van Dongen}, Thomas},
   title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
@@ -96,4 +405,4 @@ Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) i
   url          = {https://github.com/MinishLab/model2vec},
   license      = {MIT}
 }
-```

 ---
+language:
+- code
 license: mit
+library_name: model2vec
 tags:
+- model2vec
 - embeddings
+- code
+- retrieval
 - static-embeddings
 ---
+# potion-code-16M Model Card
+## Overview
+**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
+It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
 ## Installation
+```bash
 pip install model2vec
 ```
 ## Usage
 ```python
 from model2vec import StaticModel
+model = StaticModel.from_pretrained("Pringled/potion-code-16M")
+# Embed natural language queries
+query_embeddings = model.encode(["How to read a file in Python?"])
+# Embed code documents
+code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])
 ```
+## How it works
+potion-code-16M is created using the following pipeline:
+1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
+2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
+3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
+4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
+5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
+## Results
+Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) (NDCG@10, `mteb>=2.10`):
+| Model | Params | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL | **AVG** |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| CodeRankEmbed | 137M | - | - | - | - | - | - | - | - | - | - | - |
+| BM25 | — | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 | 39.11 |
+| **potion-code-16M** | **16M** | **3.97** | **42.99** | **36.26** | **50.27** | **43.40** | **39.76** | **31.72** | **21.37** | **57.47** | **43.34** | **37.05** |
+*Results for CodeRankEmbed coming soon.*
+## Model Details
+| Property | Value |
+|---|---|
+| Parameters | ~16M |
+| Embedding dimensions | 256 |
+| Vocabulary size | ~62,500 |
+| Teacher model | nomic-ai/CodeRankEmbed |
+| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
+| Max sequence length | 1,000,000 tokens (static, no limit in practice) |
+## Additional Resources
+- [Model2Vec repository](https://github.com/MinishLab/model2vec)
+- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
+- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
+- [CoIR benchmark](https://github.com/CoIR-team/coir)
+## Reproducibility
+The following script reproduces this model end-to-end. It requires the tokenlearn training data from `Pringled/cornstack-docs-tokenlearn` and `Pringled/cornstack-queries-tokenlearn` (20k samples per language used).
 ```python
+"""Reproduction script for potion-code-16M.
+Runs the full pipeline: distill → tokenlearn → contrastive fine-tuning.
+Requirements:
+    pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
+The three model checkpoints are saved to:
+    ./models/potion-code-16M-distilled
+    ./models/potion-code-16M-tokenlearn
+    ./models/potion-code-16M-contrastive  ← final model
+"""
+from __future__ import annotations
+import logging
+import random
+import numpy as np
+import torch
+from datasets import Dataset, concatenate_datasets, load_dataset
+from huggingface_hub import snapshot_download
+from model2vec import StaticModel
+from model2vec.distill import distill_from_model
+from model2vec.distill.inference import post_process_embeddings
+from pathlib import Path
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+)
+from sentence_transformers.losses import MultipleNegativesRankingLoss
+from sentence_transformers.models import StaticEmbedding
+from sentence_transformers.training_args import BatchSamplers
+from skeletoken import TokenizerModel
+from sklearn.decomposition import PCA
+from tokenlearn.losses import Loss
+from tokenlearn.model import StaticModelForFineTuning
+from tokenlearn.utils import create_vocab
+from transformers import AutoModel, AutoTokenizer
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Hyperparameters
+# ---------------------------------------------------------------------------
+TEACHER_MODEL = "nomic-ai/CodeRankEmbed"
+OUTPUT_DIR = Path("models")
+# Distill
+VOCAB_SIZE = 42_000      # extra tokens mined from CornStack → ~62.5k total → ~16M params
+PCA_DIMS = 256
+SIF_COEFFICIENT = 1e-4
+# Tokenlearn
+TOKENLEARN_DOCS_DATASET = "Pringled/cornstack-docs-tokenlearn"
+TOKENLEARN_QUERIES_DATASET = "Pringled/cornstack-queries-tokenlearn"
+TOKENLEARN_LANGUAGES = ["go", "java", "javascript", "php", "python", "ruby"]
+TOKENLEARN_MAX_PER_LANGUAGE = 20_000   # 20k docs + 20k queries × 6 langs = 240k total
+TOKENLEARN_LR = 1e-3
+TOKENLEARN_MAX_EPOCHS = 20             # early stopping (patience=5) typically kicks in earlier
+TOKENLEARN_BATCH_SIZE = 128
+# Contrastive
+CORNSTACK_DATASETS = {
+    "python": "nomic-ai/cornstack-python-v1",
+    "java": "nomic-ai/cornstack-java-v1",
+    "php": "nomic-ai/cornstack-php-v1",
+    "go": "nomic-ai/cornstack-go-v1",
+    "javascript": "nomic-ai/cornstack-javascript-v1",
+    "ruby": "nomic-ai/cornstack-ruby-v1",
+}
+CONTRASTIVE_MAX_PER_LANGUAGE = 20_000  # 20k × 6 langs = 120k pairs total
+CONTRASTIVE_LR = 5e-3
+CONTRASTIVE_EPOCHS = 3
+CONTRASTIVE_BATCH_SIZE = 512
+CONTRASTIVE_SEED = 42
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def apply_post_sif(model: StaticModel, pca_dims: int, sif_coefficient: float) -> StaticModel:
+    embeddings_np = model.embedding.astype(np.float32)
+    processed, weights = post_process_embeddings(
+        embeddings_np, pca_dims=pca_dims, sif_coefficient=sif_coefficient
+    )
+    logger.info("post_process_embeddings: %s → %s", embeddings_np.shape, processed.shape)
+    model.embedding = processed
+    model.weights = weights
+    return model
+# ---------------------------------------------------------------------------
+# Step 1: Distill
+# ---------------------------------------------------------------------------
+def run_distill(save_path: Path) -> None:
+    logger.info("Downloading %s ...", TEACHER_MODEL)
+    local_path = snapshot_download(TEACHER_MODEL)
+    model = AutoModel.from_pretrained(local_path, trust_remote_code=True)
+    tokenizer = AutoTokenizer.from_pretrained(local_path, trust_remote_code=True, use_fast=True)
+    # Load tokenlearn corpus texts for vocab mining (docs + queries, 20k/lang)
+    logger.info("Loading texts for vocabulary mining ...")
+    shards = []
+    for lang in TOKENLEARN_LANGUAGES:
+        docs = load_dataset(TOKENLEARN_DOCS_DATASET, name=lang, split=f"train[:{TOKENLEARN_MAX_PER_LANGUAGE}]")
+        queries = load_dataset(TOKENLEARN_QUERIES_DATASET, name=lang, split=f"train[:{TOKENLEARN_MAX_PER_LANGUAGE}]")
+        shards.extend([docs, queries])
+    corpus = concatenate_datasets(shards)
+    texts: list[str] = list(corpus["text"])
+    logger.info("Loaded %d texts for vocab mining.", len(texts))
+    logger.info("Mining vocabulary (target size=%d) ...", VOCAB_SIZE)
+    vocab = create_vocab(texts=texts, vocab_size=VOCAB_SIZE)
+    logger.info("Mined %d tokens.", len(vocab))
+    # Filter: keep only new single-token entries not already in CodeRankEmbed vocabulary.
+    tokenizer_model = TokenizerModel.from_transformers_tokenizer(tokenizer).prune_added_tokens()
+    preprocessor = tokenizer_model.preprocessor
+    seen = set(tokenizer_model.sorted_vocabulary)
+    filtered = []
+    for token in vocab:
+        preprocessed = preprocessor.preprocess(token)
+        if len(preprocessed) == 1 and preprocessed[0] not in seen:
+            seen.add(preprocessed[0])
+            filtered.append(preprocessed[0])
+    logger.info("Vocabulary after filtering: %d tokens added to CodeRankEmbed.", len(filtered))
+    # NomicBERT requires monkey-patched embedding accessors.
+    model.get_input_embeddings = lambda: model.embeddings.word_embeddings
+    model.set_input_embeddings = lambda v: setattr(model.embeddings, "word_embeddings", v)
+    logger.info("Distilling (pca_dims=%d, sif=%g) ...", PCA_DIMS, SIF_COEFFICIENT)
+    static_model = distill_from_model(
+        model=model,
+        tokenizer=tokenizer,
+        vocabulary=filtered,
+        pca_dims=PCA_DIMS,
+        sif_coefficient=SIF_COEFFICIENT,
+        pooling="mean",
+        quantize_to="float32",
+    )
+    save_path.mkdir(parents=True, exist_ok=True)
+    static_model.save_pretrained(str(save_path))
+    logger.info("Distilled model saved to %s  (vocab=%d, dims=%d)",
+                save_path, static_model.embedding.shape[0], static_model.embedding.shape[1])
+# ---------------------------------------------------------------------------
+# Step 2: Tokenlearn
+# ---------------------------------------------------------------------------
+def run_tokenlearn(base_model_path: Path, save_path: Path) -> None:
+    # Load 20k docs + 20k queries per language → 240k total
+    logger.info("Loading tokenlearn data (docs + queries, %d/lang × %d langs) ...",
+                TOKENLEARN_MAX_PER_LANGUAGE, len(TOKENLEARN_LANGUAGES))
+    shards = []
+    for lang in TOKENLEARN_LANGUAGES:
+        docs = load_dataset(TOKENLEARN_DOCS_DATASET, name=lang, split=f"train[:{TOKENLEARN_MAX_PER_LANGUAGE}]")
+        queries = load_dataset(TOKENLEARN_QUERIES_DATASET, name=lang, split=f"train[:{TOKENLEARN_MAX_PER_LANGUAGE}]")
+        shards.extend([docs, queries])
+    dataset = concatenate_datasets(shards)
+    logger.info("Total samples: %d", len(dataset))
+    train_txt: list[str] = list(dataset["text"])
+    train_vec = np.array(dataset["embedding"], dtype=np.float32)
+    non_nan_mask = ~np.isnan(train_vec).any(axis=1)
+    train_txt = np.array(train_txt)[non_nan_mask].tolist()
+    train_vec = train_vec[non_nan_mask]
+    logger.info("Loaded %d samples, raw vector shape: %s", len(train_txt), train_vec.shape)
+    logger.info("Fitting PCA to %d dims ...", PCA_DIMS)
+    pca = PCA(n_components=PCA_DIMS)
+    train_vec = pca.fit_transform(train_vec)
+    logger.info("Explained variance: %.4f. Shape: %s",
+                pca.explained_variance_ratio_.cumsum()[-1], train_vec.shape)
+    logger.info("Loading base model from %s ...", base_model_path)
+    base_model = StaticModel.from_pretrained(str(base_model_path), force_download=False)
+    if base_model.embedding.dtype != np.float32:
+        base_model.embedding = base_model.embedding.astype(np.float32)
+    trainable = StaticModelForFineTuning.from_static_model(
+        model=base_model,
+        out_dim=PCA_DIMS,
+        loss=Loss("cosine"),
+    )
+    logger.info("Training tokenlearn (lr=%g, max_epochs=%d, batch=%d) ...",
+                TOKENLEARN_LR, TOKENLEARN_MAX_EPOCHS, TOKENLEARN_BATCH_SIZE)
+    trainable.fit(
+        X=train_txt,
+        y=torch.from_numpy(train_vec.astype(np.float32)),
+        batch_size=TOKENLEARN_BATCH_SIZE,
+        learning_rate=TOKENLEARN_LR,
+        max_epochs=TOKENLEARN_MAX_EPOCHS,
+        early_stopping_patience=5,
+        use_wandb=False,
+    )
+    logger.info("Tokenlearn training complete.")
+    trained_model = trainable.to_static_model()
+    trained_model = apply_post_sif(trained_model, pca_dims=PCA_DIMS, sif_coefficient=SIF_COEFFICIENT)
+    save_path.mkdir(parents=True, exist_ok=True)
+    trained_model.save_pretrained(str(save_path))
+    logger.info("Tokenlearn model saved to %s", save_path)
+# ---------------------------------------------------------------------------
+# Step 3: Contrastive fine-tuning (MNRL)
+# ---------------------------------------------------------------------------
+def run_contrastive(base_model_path: Path, save_path: Path) -> None:
+    random.seed(CONTRASTIVE_SEED)
+    logger.info("Streaming CornStack pairs (%d/lang × %d langs) ...",
+                CONTRASTIVE_MAX_PER_LANGUAGE, len(CORNSTACK_DATASETS))
+    all_queries: list[str] = []
+    all_docs: list[str] = []
+    for lang, hf_name in CORNSTACK_DATASETS.items():
+        hf_ds = load_dataset(hf_name, split="train", streaming=True)
+        hf_ds = hf_ds.shuffle(seed=CONTRASTIVE_SEED, buffer_size=10_000)
+        kept = 0
+        seen_q: set[str] = set()
+        seen_d: set[str] = set()
+        for row in hf_ds:
+            q, d = row.get("query"), row.get("document")
+            if not isinstance(q, str) or not isinstance(d, str):
+                continue
+            if len(q) < 32 or len(d) < 32:
+                continue
+            if q in seen_q or d in seen_d:
+                continue
+            seen_q.add(q)
+            seen_d.add(d)
+            all_queries.append(q)
+            all_docs.append(d)
+            kept += 1
+            if kept >= CONTRASTIVE_MAX_PER_LANGUAGE:
+                break
+        logger.info("  %s: %d pairs", lang, kept)
+    logger.info("Total pairs: %d", len(all_queries))
+    train_dataset = Dataset.from_dict({"anchor": all_queries, "positive": all_docs})
+    static_embedding = StaticEmbedding.from_model2vec(str(base_model_path))
+    model = SentenceTransformer(modules=[static_embedding])
+    loss = MultipleNegativesRankingLoss(model)
+    training_args = SentenceTransformerTrainingArguments(
+        output_dir=str(save_path) + "-checkpoints",
+        num_train_epochs=CONTRASTIVE_EPOCHS,
+        per_device_train_batch_size=CONTRASTIVE_BATCH_SIZE,
+        learning_rate=CONTRASTIVE_LR,
+        warmup_steps=0.1,
+        fp16=False,
+        bf16=False,
+        batch_sampler=BatchSamplers.NO_DUPLICATES,
+        save_strategy="no",
+        logging_steps=100,
+        logging_first_step=True,
+        report_to=[],
+    )
+    logger.info("Training contrastive (lr=%g, epochs=%d, batch=%d) ...",
+                CONTRASTIVE_LR, CONTRASTIVE_EPOCHS, CONTRASTIVE_BATCH_SIZE)
+    trainer = SentenceTransformerTrainer(
+        model=model, args=training_args, train_dataset=train_dataset, loss=loss,
+    )
+    trainer.train()
+    logger.info("Contrastive training complete.")
+    base_m2v = StaticModel.from_pretrained(str(base_model_path), force_download=False)
+    base_m2v.embedding = model[0].embedding.weight.detach().cpu().float().numpy()
+    final_model = apply_post_sif(base_m2v, pca_dims=PCA_DIMS, sif_coefficient=SIF_COEFFICIENT)
+    save_path.mkdir(parents=True, exist_ok=True)
+    final_model.save_pretrained(str(save_path))
+    logger.info("Final model saved to %s", save_path)
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    distilled_path = OUTPUT_DIR / "potion-code-16M-distilled"
+    tokenlearn_path = OUTPUT_DIR / "potion-code-16M-tokenlearn"
+    contrastive_path = OUTPUT_DIR / "potion-code-16M-contrastive"
+    logger.info("=== Step 1/3: Distill ===")
+    run_distill(save_path=distilled_path)
+    logger.info("=== Step 2/3: Tokenlearn ===")
+    run_tokenlearn(base_model_path=distilled_path, save_path=tokenlearn_path)
+    logger.info("=== Step 3/3: Contrastive ===")
+    run_contrastive(base_model_path=tokenlearn_path, save_path=contrastive_path)
+    logger.info("Done. Final model: %s", contrastive_path)
 ```
+## Citation
+```bibtex
 @software{minishlab2024model2vec,
   author       = {Stephan Tulkens and {van Dongen}, Thomas},
   title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
   url          = {https://github.com/MinishLab/model2vec},
   license      = {MIT}
 }
+```