code-daemon-embed-v1

A small, fast code embedding model purpose-built to vectorize a code graph (functions, methods, doc-chunks) for on-device semantic code search. It ships with the UltraCode MCP server, running as a TensorRT / TVM / OpenVINO / ONNX engine.

It is deliberately specialized for short code units, not long documents — long-text handling was intentionally dropped (max sequence 128 tokens) to maximize embedding throughput. Code-graph nodes are short (entity names, signatures, doc-chunks); spending capacity and latency on a long-context path would only slow the hot path it never uses.

768-dim embeddings, Matryoshka (MRL) truncatable to 512 / 256 with graceful decay.
~54.5M params — XLM-RoBERTa architecture, 4 layers / 768 hidden, code-only 32k SentencePiece vocab.
Mean pooling baked into the graph — output is already pooled ([batch, 768]); just L2-normalize.
Trained at sequence length 128; length buckets s/m/l = seq 40 / 64 / 128.

How it was made

Knowledge-distilled (embedding regression) from the teacher nomic-ai/CodeRankEmbed (MIT, 137M, a strong code retriever). The student is a fresh, shallow-wide XLM-R encoder trained from scratch on the teacher's passage embeddings over a ~32M-sample code + text corpus, with a custom 32k code-oriented SentencePiece vocabulary (syntax + identifier lexicon rather than prose).

Why shallow-wide (4l/768h) + code vocab: on an internal code-search golden set this beat both a deeper 6-layer variant and the earlier 64k-prose-vocab cut — depth hurt, a code-tuned vocab and a wide body helped.

Built for speed

This model trades long-context capability for raw throughput on short code units:

Short context by design — max 128 tokens, no long-document path. Code-graph nodes are short (entity names, signatures, doc-chunks), so the model and its engines are tuned only for that, avoiding the cost of a wide dynamic shape range.
Rectangular TensorRT profiles — each length bucket is built with a fixed shape (min == opt == max), not a dynamic range, so the autotuner locks one optimal kernel set per bucket: s = batch 64 × seq 40 · m = batch 128 × seq 64 · l = batch 256 × seq 128.
INT8 (W8A16) weights; mean-pool + projection + L2-norm fused into the graph (one pass → [B, 768]).

Intended use

Semantic code search / code retrieval, and general (multilingual) text retrieval as a fallback.
Embed queries and documents the same way (no instruction prefix — the student was distilled on passage embeddings, unlike the teacher whose prefix is query-only). Mean-pool → L2-normalize.
For smaller indexes, truncate to 256 or 512 dims (MRL) before normalizing.

The daemon runs the bundled engines directly (this repo is its CDN), but the FP32 model.onnx is also bundled for standalone use. The recipe below runs it with onnxruntime: tokenize with the bundled sentencepiece.bpe.model, run, and the pooled [B,768] is already produced — just L2-normalize:

import onnxruntime as ort, sentencepiece as spm, numpy as np

sp   = spm.SentencePieceProcessor(model_file="sentencepiece.bpe.model")  # pad=0 unk=1 bos=2 eos=3
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts, max_len=128, mrl_dim=768):
    ids  = [[2, *sp.encode(t)[: max_len - 2], 3] for t in texts]          # bos … eos
    L    = max(len(x) for x in ids)
    inp  = np.array([x + [0] * (L - len(x)) for x in ids], dtype=np.int64) # pad=0
    mask = (inp != 0).astype(np.int64)
    out  = sess.run(None, {"input_ids": inp, "attention_mask": mask})[0]   # already mean-pooled [B,768]
    out  = out[:, :mrl_dim]                                                # MRL truncation (768/512/256)
    return out / np.linalg.norm(out, axis=1, keepdims=True)

What's in this repo — ready-to-run compiled engines

This repo holds pre-compiled, ready-to-run engines, named per runtime × GPU arch × OS × length-bucket — grab the compiled model that matches your runtime and hardware and use it directly, with no compilation on your machine.

TensorRT *.engine — NVIDIA, INT8 W8A16, per arch × OS × bucket: code-daemon-embed-v1-{s,m,l}_{win_x64,linux_x64}_trt_sm_{86,89,120}.engine (sm_86 ≈ RTX 30xx / A-series · sm_89 ≈ RTX 40xx / L4 · sm_120 ≈ RTX 50xx).
TVM *_tvm_vulkan.{dll,so} — Vulkan fallback for non-TRT / older NVIDIA & other GPUs, per bucket.
OpenVINO *.xml + *.bin — Intel CPU / iGPU / NPU, per bucket.
Metal *_tvm_metal.* — Apple Silicon (macOS), per bucket.
Tokenizer — sentencepiece.bpe.model (the model's SentencePiece; specials baked at pad=0 / unk=1 / bos=2 / eos=3, byte-fallback) + tokenizer_config.json. The daemon loads the SP directly.
ONNX source — model.onnx (+ model.onnx.data) FP32 and model_int8qdt.onnx (INT8 W8A16) — for standalone onnxruntime / optimum use, and the source the engines are compiled from.

Evaluation — in-scope CoIR (sub-CoIR)

CoIR is a broad code-retrieval benchmark, but 4 of its 10 tasks are out of scope for a code-graph search engine (code↔code translation, multi-turn dialogue, long problem-statements — the daemon never performs these). The honest, relevant view is the in-scope subset — the retrieval patterns this model is actually built for (NDCG@10, full corpora):

CoIR task (in-scope)	NDCG@10	Pattern
codesearchnet (6-lang avg)	74.64	docstring / NL → code (the core path)
stackoverflow-qa	53.18	short question → code
synthetic-text2sql	50.15	NL → SQL
codefeedback-st	47.71	NL instruction → code
codesearchnet-ccr (6-lang avg)	44.30	code → related code (clone/dup)
cosqa	32.14	NL question → code (noisy / hard)
In-scope average (sub-CoIR)	51.56

codesearchnet per language (NL→code): python 91.96, go 82.27, java 76.02, php 68.98, ruby 65.94, js 62.66.

The full 10-task official CoIR average (36.67) is dragged down by the 4 out-of-scope tasks and is not representative of the real query mix. For scale, the 1.5B-class bge-code-v1 scores 81.77 on full CoIR — this is a 54.5M model (27× smaller) tuned for one job.

On the daemon's own search-gold golden set (its real query distribution): hit@5 0.692 — +80% over the retired v1.1 cut (0.385). Binary (1-bit) vectors retain ~91% of float NDCG before rescore.

Performance (embeddings / sec)

Backend	Hardware	Throughput
TensorRT INT8	NVIDIA RTX 5060 (sm_120)	~20,000 emb/s
OpenVINO INT4	Intel iGPU (Xe2, Lunar Lake)	~580 emb/s
OpenVINO INT4	Intel NPU (NPU4)	~574 emb/s
OpenVINO INT8	Intel CPU (Core Ultra)	~375 emb/s
OpenVINO — all 3 in parallel	iGPU + NPU + CPU concurrently	~1,290 emb/s

The combined figure is genuine concurrent multi-device execution: three independent workers — one bound to each of the iGPU, NPU and CPU — embed different batches at the same time, and the throughputs add up. This is not OpenVINO's AUTO mode (which selects a single device per inference and never runs the three simultaneously); the daemon length-sorts inputs and fans the buckets across all three devices. TRT is infer throughput on the bucketed batch path; OV figures measured on a Core Ultra (Lunar Lake) laptop.

License & training data

Released under the MIT license.

The teacher (nomic-ai/CodeRankEmbed) is MIT, and the XLM-R architecture is MIT. As is standard practice for distilled embedding models, the weights are released under MIT. For transparency, the training corpus the teacher embedded includes:

Dataset	License note
`Fsoft-AIC/the-vault-function` (code)	dataset MIT; underlying code has mixed upstream provenance
`unicamp-dl/mmarco` (EN/RU retrieval)	MS MARCO-derived → non-commercial research terms
`sentence-transformers/all-nli`	SNLI (CC BY-SA 4.0) + MultiNLI
`sentence-transformers/gooaq`	Apache-2.0
`jinaai/negation-dataset`	see source repo

⚠️ If your use requires strict training-data-license compliance, note that mMARCO derives from MS MARCO (non-commercial). Whether a distilled model inherits dataset-use terms is legally unsettled; this is not legal advice. A data-clean variant can be retrained without the mMARCO splits if needed.

Attribution

Distilled from nomic-ai/CodeRankEmbed (MIT). Backbone: XLM-RoBERTa (MIT).

Downloads last month: 38

Model tree for faxenoff/code-daemon-embed-v1

Base model

Snowflake/snowflake-arctic-embed-m-long

Finetuned

nomic-ai/CodeRankEmbed

Quantized

(17)

this model

faxenoff
/

code-daemon-embed-v1