tarsur909 commited on
Commit
28ddab3
·
verified ·
1 Parent(s): 1b6c197

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -0
README.md CHANGED
@@ -20,6 +20,8 @@ base_model:
20
  | Voyage-Code-002 | Unknown | 68.5 | 56.3 |
21
 
22
 
 
 
23
  # Usage
24
 
25
  **Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
@@ -47,3 +49,9 @@ print(query_embeddings)
47
  code_embeddings = model.encode(codes)
48
  print(code_embeddings)
49
  ```
 
 
 
 
 
 
 
20
  | Voyage-Code-002 | Unknown | 68.5 | 56.3 |
21
 
22
 
23
+ We release the scripts to evaluate our model's performance [here](https://github.com/gangiswag/cornstack).
24
+
25
  # Usage
26
 
27
  **Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
 
49
  code_embeddings = model.encode(codes)
50
  print(code_embeddings)
51
  ```
52
+
53
+
54
+
55
+ ## Training
56
+ We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
57
+