Edit model card

Table of Contents

Model Description

This model is developed based on Codebert and a 5M subset of The Vault to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset.

More information:

Model Details

  • Developed by: Fsoft AI Center
  • License: MIT
  • Model type: Transformer-Encoder based Language Model
  • Architecture: BERT-base
  • Data set: The Vault
  • Tokenizer: Byte Pair Encoding
  • Vocabulary Size: 50265
  • Sequence Length: 512
  • Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
  • Training details:
    • Self-supervised learning, binary classification
    • Positive class: Original code-docstring pair
    • Negative class: Random pairing code and docstring


The input to the model follows the below template:


from transformers import AutoTokenizer

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

input = "<s>Sum two integers</s></s>def sum(a, b):\n    return a + b</s>"
tokenized_input = tokenizer(input, add_special_tokens= False)

Using model with Jax and Pytorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification

#Load model with jax
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

#Load model with torch
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")


This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.

It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result.

Additional information

Licensing Information

MIT License

Citation Information

  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
Downloads last month
Model size
125M params
Tensor type

Collection including Fsoft-AIC/Codebert-docstring-inconsistency