mpnet-code-search / README.md
lukejagg's picture
Create README.md
42c5fd8
metadata
pipeline_tag: sentence-similarity
datasets:
  - gonglinyuan/CoSQA
  - AdvTest
tags:
  - sentence-transformers
  - feature-extraction
  - code-similarity
language: en
license: apache-2.0

mpnet-code-search

This is a finetuned sentence-transformers model. It was trained on Natural Language-Programming Language pairs, improving the performance for code search and retrieval applications.

Usage (Sentence-Transformers)

This model can be loaded with sentence-transformers:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["Print hello world to stdout", "print('hello world')"]

model = SentenceTransformer('sweepai/mpnet-code-search')
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

MRR for CoSQA and AdvTest dataset:

  • Base model
  • Finetuned model

Background

This project aims to improve the performance of the fine-tuned SBERT MPNet model for coding applications.

We developed this model to use in our own app, Sweep, an AI-powered junior developer.

Intended Uses

Our model is intended to be used on code search applications, allowing users to search natural language prompts and find corresponding code chunks.

Chunking (Open-Source)

We developed our own chunking algorithm to improve the quality of a repository's code snippets. This tree-based algorithm is described in Our Blog Post.

Demo

We created an interactive demo for our new chunking algorithm.


Training Procedure

Base Model

We use the pretrained sentence-transformers/all-mpnet-base-v2. Please refer to the model card for a more detailed overview on training data.

Finetuning

We finetune the model using a contrastive objective.

Hyperparameters

We trained on 8x A5000s.

Training Data

| Dataset | Number of training tuples | | CoSQA | 20,000 | | AdvTest | 250,000 | | Total | 270,000 |