YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CoCoSoDa: Effective Contrastive Learning for Code Search

Our approach adopts the pre-trained model as the base code/query encoder and optimizes it using multimodal contrastive learning and soft data augmentation.

1

CoCoSoDa is comprised of the following four components:

  • Pre-trained code/query encoder captures the semantic information of a code snippet or a natural language query and maps it into a high-dimensional embedding space. as the code/query encoder.

  • Momentum code/query encoder encodes the samples (code snippets or queries) of current and previous mini-batches to enrich the negative samples.

  • Soft data augmentation is to dynamically mask or replace some tokens in a sample (code/query) to generate a similar sample as a form of data augmentation.

  • Multimodal contrastive learning loss function is used as the optimization objective and consists of inter-modal and intra-modal contrastive learning loss. They are used to minimize the distance of the representations of similar samples and maximize the distance of different samples in the embedding space.

Source code

Environment

conda create -n CoCoSoDa python=3.6 -y
conda activate CoCoSoDa
pip install torch==1.10  transformers==4.12.5 seaborn==0.11.2 fast-histogram nltk==3.6.5 networkx==2.5.1 tree_sitter tqdm prettytable gdown more-itertools tensorboardX sklearn  

Data

cd dataset
bash get_data.sh 

Data statistic is shown in this Table.

PL Training Validation Test Candidate Codes
Ruby 24,927 1,400 1,261 4,360
JavaScript 58,025 3,885 3,291 13,981
Java 164,923 5,183 10,955 40,347
Go 167,288 7,325 8,122 28,120
PHP 241,241 12,982 14,014 52,660
Python 251,820 13,914 14,918 43,827

It will take about 10min.

Training and Evaualtion

We have uploaded the pre-trained model to huggingface. You can directly download DeepSoftwareAnalytics/CoCoSoDa and fine-tune it.

Pre-training (Optional)

bash run_cocosoda.sh $lang 

The optimized model is saved in ./saved_models/cocosoda/. You can upload them to huggingface.

It will take about 3 days.

Fine-tuning

lang=java
bash run_fine_tune.sh $lang 

Zero-shot running

lang=python
bash run_zero-shot.sh $lang 

Results

The Model Evaluated with MRR

Model Ruby Javascript Go Python Java PHP Avg.
CoCoSoDa 0.818 0.764 0.921 0.757 0.763 0.703 0.788

Appendix

The description of baselines, addtional experimetal results and discussion are shown in Appendix/Appendix.pdf.

Contact

Feel free to contact Ensheng Shi (enshengshi@qq.com) if you have any further questions or no response to github issue for more than 1 day.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .