jamie8johnson commited on
Commit
151669b
·
verified ·
1 Parent(s): 7119286

Add model card with v3.v2 eval results + usage

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - code
6
+ tags:
7
+ - code-search
8
+ - embeddings
9
+ - onnx
10
+ - sentence-similarity
11
+ - cqs
12
+ library_name: sentence-transformers
13
+ pipeline_tag: sentence-similarity
14
+ base_model: nomic-ai/CodeRankEmbed
15
+ ---
16
+
17
+ # CodeRankEmbed (ONNX export)
18
+
19
+ ONNX export of [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) — a 137M-parameter code search embedder built on `Snowflake/snowflake-arctic-embed-m-long`. Exported for use with [cqs](https://github.com/jamie8johnson/cqs)'s ONNX Runtime embedding pipeline; no PyTorch dependency required.
20
+
21
+ This is a faithful conversion of the upstream weights — no fine-tuning, no quantization. License and behavior match the upstream model.
22
+
23
+ ## Specs
24
+
25
+ - **Base:** `nomic-ai/CodeRankEmbed` (137M params, 768-dim, 8192 max seq)
26
+ - **Format:** ONNX (FP32)
27
+ - **Pooling:** Mean
28
+ - **Query prefix:** `Represent this query for searching relevant code: ` (required — see usage)
29
+ - **Document prefix:** none
30
+
31
+ ## Production Eval (cqs v3.v2 fixture, 2026-05-01)
32
+
33
+ Run against cqs's production fixture (218 queries: 109 test + 109 dev) on the cqs codebase itself. Numbers are with cqs's full hybrid-search stack (dense + FTS + SPLADE blend, name-boost, type-boost, MMR-off):
34
+
35
+ | split | metric | BGE-large (1024-dim) | **CodeRankEmbed (768-dim)** | v9-200k (768-dim) |
36
+ |-------|--------|---------------------:|----------------------------:|------------------:|
37
+ | test | R@1 | 43.1% | 42.2% | 45.9% |
38
+ | test | R@5 | 69.7% | **67.9%** | 70.6% |
39
+ | test | R@20 | **83.5%** | 79.8% | 80.7% |
40
+ | dev | R@1 | 45.9% | **47.7%** | 46.8% |
41
+ | dev | R@5 | **77.1%** | 69.7% | 68.8% |
42
+ | dev | R@20 | **86.2%** | 81.7% | 81.7% |
43
+
44
+ **Verdict:** edges out BGE-large on dev R@1, otherwise close on test and behind on dev R@5/R@20. Best fit when you want a code-specialist embedder at 1/3 the BGE-large parameter count without trading off too much on diverse natural-language queries. cqs ships it as an opt-in preset (not the default) — set `CQS_EMBEDDING_MODEL=nomic-coderank` or use `cqs slot create coderank --model nomic-coderank`.
45
+
46
+ ## Usage
47
+
48
+ ### With cqs
49
+
50
+ ```bash
51
+ # Full reindex with this model
52
+ export CQS_EMBEDDING_MODEL=nomic-coderank
53
+ cqs index --force
54
+
55
+ # Or, for slot-based comparisons:
56
+ cqs slot create coderank --model nomic-coderank
57
+ cqs index --slot coderank --force
58
+ ```
59
+
60
+ cqs handles the query-prefix wiring automatically. Documents are encoded without a prefix per the upstream convention.
61
+
62
+ ### Direct ONNX
63
+
64
+ ```python
65
+ import onnxruntime as ort
66
+ from transformers import AutoTokenizer
67
+ import numpy as np
68
+
69
+ session = AutoTokenizer.from_pretrained("jamie8johnson/CodeRankEmbed-onnx")
70
+ ort_session = ort.InferenceSession("model.onnx")
71
+ tokenizer = AutoTokenizer.from_pretrained("nomic-ai/CodeRankEmbed")
72
+
73
+ # Query prefix is REQUIRED
74
+ query = "Represent this query for searching relevant code: find functions that validate email addresses"
75
+ code = "def validate_email(addr): ..." # no prefix on documents
76
+
77
+ q_inputs = tokenizer(query, return_tensors="np", padding=True, truncation=True, max_length=8192)
78
+ q_out = ort_session.run(None, dict(q_inputs))
79
+ # Mean-pool over the token dimension and L2-normalize for cosine similarity.
80
+ ```
81
+
82
+ ## License
83
+
84
+ MIT, inherited from the upstream `nomic-ai/CodeRankEmbed` model.
85
+
86
+ ## Citation
87
+
88
+ Please cite the upstream model:
89
+
90
+ ```
91
+ @misc{nomic-coderank-embed,
92
+ author = {Nomic AI},
93
+ title = {CodeRankEmbed},
94
+ year = {2024},
95
+ publisher = {HuggingFace},
96
+ url = {https://huggingface.co/nomic-ai/CodeRankEmbed}
97
+ }
98
+ ```