pipeline_tag: sentence-similarity
datasets:
- gonglinyuan/CoSQA
- AdvTest
tags:
- sentence-transformers
- feature-extraction
- code-similarity
language: en
license: apache-2.0
mpnet-code-search
This is a finetuned sentence-transformers model. It was trained on Natural Language-Programming Language pairs, improving the performance for code search and retrieval applications.
Usage (Sentence-Transformers)
This model can be loaded with sentence-transformers:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["Print hello world to stdout", "print('hello world')"]
model = SentenceTransformer('sweepai/mpnet-code-search')
embeddings = model.encode(sentences)
print(embeddings)
Evaluation Results
MRR for CoSQA and AdvTest dataset:
- Base model
- Finetuned model
Background
This project aims to improve the performance of the fine-tuned SBERT MPNet model for coding applications.
We developed this model to use in our own app, Sweep, an AI-powered junior developer.
Intended Uses
Our model is intended to be used on code search applications, allowing users to search natural language prompts and find corresponding code chunks.
Chunking (Open-Source)
We developed our own chunking algorithm to improve the quality of a repository's code snippets. This tree-based algorithm is described in Our Blog Post.
Demo
We created an interactive demo for our new chunking algorithm.
Training Procedure
Base Model
We use the pretrained sentence-transformers/all-mpnet-base-v2
. Please refer to the model card for a more detailed overview on training data.
Finetuning
We finetune the model using a contrastive objective.
Hyperparameters
We trained on 8x A5000s.
Training Data
| Dataset | Number of training tuples | | CoSQA | 20,000 | | AdvTest | 250,000 | | Total | 270,000 |