README.md · Fsoft-AIC/Codebert-docstring-inconsistency at main

metadata

language:
  - code
  - en
task_categories:
  - text-classification
tags:
  - arxiv:2305.06156
license: mit
metrics:
  - accuracy
widget:
  - text: |-
      Sum two integers</s></s>def sum(a, b):
          return a + b
    example_title: Simple toy
  - text: >-
      Look for methods that might be dynamically defined and define them for
      lookup.</s></s>def respond_to_missing?(name, include_private = false)
        if name == :to_ary || name == :empty?
          false
        else
          return true if mapping(name).present?
          mounting = all_mountings.find{ |mount| mount.respond_to?(name) }
          return false if mounting.nil?
        end
      end
    example_title: Ruby example
  - text: >-
      Method that adds a candidate to the party @param c the candidate that will
      be added to the party</s></s>public void addCandidate(Candidate c)

      {
          this.votes += c.getVotes(); 
          candidates.add(c); 
      }
    example_title: Java example
  - text: |-
      we do not need Buffer pollyfill for now</s></s>function(str){
        var ret = new Array(str.length), len = str.length;
        while(len--) ret[len] = str.charCodeAt(len);
        return Uint8Array.from(ret);
      }
    example_title: JavaScript example
pipeline_tag: text-classification

Model Description
Model Details
Usage
Limitations
Additional Information
- Licensing Information
- Citation Information

Model Description

This model is developed based on Codebert and a 5M subset of The Vault to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset.

More information:

Repository: FSoft-AI4Code/TheVault
Paper: The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Contact: support.ailab@fpt.com

Model Details

Developed by: Fsoft AI Center
License: MIT
Model type: Transformer-Encoder based Language Model
Architecture: BERT-base
Data set: The Vault
Tokenizer: Byte Pair Encoding
Vocabulary Size: 50265
Sequence Length: 512
Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
Training details:
- Self-supervised learning, binary classification
- Positive class: Original code-docstring pair
- Negative class: Random pairing code and docstring

Usage

The input to the model follows the below template:

"""
Template:
<s>{docstring}</s></s>{code}</s>

Example:
from transformers import AutoTokenizer

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

input = "<s>Sum two integers</s></s>def sum(a, b):\n    return a + b</s>"
tokenized_input = tokenizer(input, add_special_tokens= False)
"""

Using model with Jax and Pytorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification

#Load model with jax
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

#Load model with torch
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

Limitations

This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.

It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result.

Additional information

Licensing Information

MIT License

Citation Information

@article{manh2023vault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
  year={2023}
}

Fsoft-AIC
/

Codebert-docstring-inconsistency

Table of Contents