|
--- |
|
language: |
|
- code |
|
- en |
|
task_categories: |
|
- text-classification |
|
tags: |
|
- arxiv:2305.06156 |
|
license: mit |
|
metrics: |
|
- accuracy |
|
widget: |
|
- text: |- |
|
Sum two integers</s></s>def sum(a, b): |
|
return a + b |
|
example_title: Simple toy |
|
- text: |- |
|
Look for methods that might be dynamically defined and define them for lookup.</s></s>def respond_to_missing?(name, include_private = false) |
|
if name == :to_ary || name == :empty? |
|
false |
|
else |
|
return true if mapping(name).present? |
|
mounting = all_mountings.find{ |mount| mount.respond_to?(name) } |
|
return false if mounting.nil? |
|
end |
|
end |
|
example_title: Ruby example |
|
- text: |- |
|
Method that adds a candidate to the party @param c the candidate that will be added to the party</s></s>public void addCandidate(Candidate c) |
|
{ |
|
this.votes += c.getVotes(); |
|
candidates.add(c); |
|
} |
|
example_title: Java example |
|
- text: |- |
|
we do not need Buffer pollyfill for now</s></s>function(str){ |
|
var ret = new Array(str.length), len = str.length; |
|
while(len--) ret[len] = str.charCodeAt(len); |
|
return Uint8Array.from(ret); |
|
} |
|
example_title: JavaScript example |
|
|
|
pipeline_tag: text-classification |
|
--- |
|
|
|
|
|
|
|
## Table of Contents |
|
- [Model Description](#model-description) |
|
- [Model Details](#model-details) |
|
- [Usage](#usage) |
|
- [Limitations](#limitations) |
|
- [Additional Information](#additional-information) |
|
- [Licensing Information](#licensing-information) |
|
- [Citation Information](#citation-information) |
|
|
|
|
|
## Model Description |
|
|
|
This model is developed based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset. |
|
|
|
More information: |
|
- **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault) |
|
- **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156) |
|
- **Contact:** support.ailab@fpt.com |
|
|
|
|
|
## Model Details |
|
* Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/) |
|
* License: MIT |
|
* Model type: Transformer-Encoder based Language Model |
|
* Architecture: BERT-base |
|
* Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) |
|
* Tokenizer: Byte Pair Encoding |
|
* Vocabulary Size: 50265 |
|
* Sequence Length: 512 |
|
* Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby) |
|
* Training details: |
|
* Self-supervised learning, binary classification |
|
* Positive class: Original code-docstring pair |
|
* Negative class: Random pairing code and docstring |
|
|
|
## Usage |
|
The input to the model follows the below template: |
|
```python |
|
""" |
|
Template: |
|
<s>{docstring}</s></s>{code}</s> |
|
|
|
Example: |
|
from transformers import AutoTokenizer |
|
|
|
#Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
|
|
|
input = "<s>Sum two integers</s></s>def sum(a, b):\n return a + b</s>" |
|
tokenized_input = tokenizer(input, add_special_tokens= False) |
|
""" |
|
``` |
|
|
|
Using model with Jax and Pytorch |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification |
|
|
|
#Load model with jax |
|
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
|
|
|
#Load model with torch |
|
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
|
``` |
|
|
|
## Limitations |
|
This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted. |
|
|
|
It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result. |
|
|
|
## Additional information |
|
### Licensing Information |
|
|
|
MIT License |
|
|
|
### Citation Information |
|
|
|
``` |
|
@article{manh2023vault, |
|
title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, |
|
author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, |
|
journal={arXiv preprint arXiv:2305.06156}, |
|
year={2023} |
|
} |
|
``` |