nomic-codesearch-onnx (INT8 Quantized)

This model is a fine-tuned version of nomic-ai/nomic-embed-text-v1.5 trained specifically for semantic code search on Python code snippets, then exported to ONNX and dynamically quantized to INT8 for efficient on-device execution (CPU/Mobile).

The final quantized model is compressed from 530 MB to 100 MB (a ~5x reduction) while maintaining high retrieval performance, making it perfect for on-device deployment on Android, iOS, or other resource-constrained environments.

Model Details

Base Model: nomic-ai/nomic-embed-text-v1.5 (137M parameters, 768-dimensional embeddings)
Fine-Tuning Dataset: code-search-net/code_search_net (Python split). Trained on 50,000 positive (docstring, function) pairs using Multiple Negatives Ranking Loss (MNR).
Training Acceleration: Apple Silicon (M4 MPS)
Export Format: ONNX (Opset 17)
Quantization: Dynamic INT8 Quantization (weights quantized to QInt8, activation optimized)
Dimensions: 768 (supports Matryoshka Representation Learning down to 256 dimensions)

Metrics

Config	Size	Mean Cosine Drift	NDCG@10 (Code Search)
Baseline Model	530 MB	0.0	~0.48
Fine-Tuned FP32 ONNX	530 MB	0.0	~0.71
Fine-Tuned INT8 ONNX	100 MB	~0.07	~0.68

Python Quickstart

To run semantic code search or generate embeddings locally using this ONNX model:

1. Install Dependencies

pip install onnxruntime transformers numpy

2. Run Inference

import os
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load tokenizer and ONNX session
# Ensure config.json, tokenizer.json, vocab.txt, etc., are in the same directory
model_dir = "./"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
session = ort.InferenceSession(os.path.join(model_dir, "model_int8.onnx"))

def embed(texts: list[str], max_length: int = 512) -> np.ndarray:
    """Return L2-normalised sentence embeddings, shape (len(texts), 768)."""
    encoded = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="np",
    )
    outputs = session.run(
        ["sentence_embedding"],
        {
            "input_ids": encoded["input_ids"].astype(np.int64),
            "attention_mask": encoded["attention_mask"].astype(np.int64),
        },
    )
    embeddings = outputs[0]  # (batch, 768)
    # L2 normalise so dot-product == cosine similarity
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / np.maximum(norms, 1e-12)

# Embed query and snippets
snippets = [
    "def add(a, b): return a + b",
    "def binary_search(arr, target): ...",
    "SELECT * FROM users WHERE age > 18"
]
query = "function that sums two numbers"

query_emb = embed([query])
code_embs = embed(snippets)

# Calculate similarity (dot product of L2-normalized embeddings)
scores = (query_emb @ code_embs.T)[0]
for idx, score in enumerate(scores):
    print(f"[{score:.4f}] {snippets[idx]}")

On-Device Deployment (Android)

This model has been successfully deployed inside a native Android application using:

ONNX Runtime Android AAR (com.microsoft.onnxruntime:onnxruntime-android) for CPU inference.
Custom WordPiece Tokenizer in Kotlin (BertTokenizer.kt) to parse strings directly on-device without JVM-overhead Python dependencies.
Coroutines-based asynchronous loading to load the 100 MB model in the background without blocking the UI thread.

For complete Android source files (MainActivity, OnnxEmbedder, and BertTokenizer), please refer to the GitHub repository: CoderOMaster/nomic-codesearch-android.

License

This project is licensed under the Apache 2.0 License.

Downloads last month: 40