Add Semble reference, MTEB link, and Semble in additional resources
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ datasets:
|
|
| 24 |
|
| 25 |
## Overview
|
| 26 |
|
| 27 |
-
**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
|
| 28 |
|
| 29 |
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
|
| 30 |
|
|
@@ -60,7 +60,7 @@ potion-code-16M is created using the following pipeline:
|
|
| 60 |
|
| 61 |
## Results
|
| 62 |
|
| 63 |
-
Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) (NDCG@10, `mteb>=2.10`):
|
| 64 |
|
| 65 |
| Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL |
|
| 66 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -86,6 +86,7 @@ CoIR covers a broad range of code retrieval scenarios. For the use case of findi
|
|
| 86 |
|
| 87 |
## Additional Resources
|
| 88 |
|
|
|
|
| 89 |
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
|
| 90 |
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
|
| 91 |
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
|
|
|
|
| 24 |
|
| 25 |
## Overview
|
| 26 |
|
| 27 |
+
**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It powers [Semble](https://github.com/MinishLab/semble), a code search library for agents. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
|
| 28 |
|
| 29 |
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
|
| 30 |
|
|
|
|
| 60 |
|
| 61 |
## Results
|
| 62 |
|
| 63 |
+
Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) on [MTEB](https://github.com/embeddings-benchmark/mteb) (NDCG@10, `mteb>=2.10`):
|
| 64 |
|
| 65 |
| Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL |
|
| 66 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| 86 |
|
| 87 |
## Additional Resources
|
| 88 |
|
| 89 |
+
- [Semble repository](https://github.com/MinishLab/semble)
|
| 90 |
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
|
| 91 |
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
|
| 92 |
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
|