Update README.md

43d2dd2 over 1 year ago

4.75 kB

	---
	language:
	- code
	- en
	task_categories:
	- text-classification
	tags:
	- arxiv:2305.06156
	license: mit
	metrics:
	- accuracy
	widget:
	- text: \|-
	Sum two integers</s></s>def sum(a, b):
	return a + b
	example_title: Simple toy
	- text: \|-
	Look for methods that might be dynamically defined and define them for lookup.</s></s>def respond_to_missing?(name, include_private = false)
	if name == :to_ary \|\| name == :empty?
	false
	else
	return true if mapping(name).present?
	mounting = all_mountings.find{ \|mount\| mount.respond_to?(name) }
	return false if mounting.nil?
	end
	end
	example_title: Ruby example
	- text: \|-
	Method that adds a candidate to the party @param c the candidate that will be added to the party</s></s>public void addCandidate(Candidate c)
	{
	this.votes += c.getVotes();
	candidates.add(c);
	}
	example_title: Java example
	- text: \|-
	we do not need Buffer pollyfill for now</s></s>function(str){
	var ret = new Array(str.length), len = str.length;
	while(len--) ret[len] = str.charCodeAt(len);
	return Uint8Array.from(ret);
	}
	example_title: JavaScript example

	pipeline_tag: text-classification
	---



	## Table of Contents
	- [Model Description](#model-description)
	- [Model Details](#model-details)
	- [Usage](#usage)
	- [Limitations](#limitations)
	- [Additional Information](#additional-information)
	- [Licensing Information](#licensing-information)
	- [Citation Information](#citation-information)


	## Model Description

	This model is developed based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset.

	More information:
	- Repository: [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
	- Paper: [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156)
	- Contact: support.ailab@fpt.com


	## Model Details
	* Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/)
	* License: MIT
	* Model type: Transformer-Encoder based Language Model
	* Architecture: BERT-base
	* Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level)
	* Tokenizer: Byte Pair Encoding
	* Vocabulary Size: 50265
	* Sequence Length: 512
	* Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
	* Training details:
	* Self-supervised learning, binary classification
	* Positive class: Original code-docstring pair
	* Negative class: Random pairing code and docstring

	## Usage
	The input to the model follows the below template:
	```python
	"""
	Template:
	<s>{docstring}</s></s>{code}</s>

	Example:
	from transformers import AutoTokenizer

	#Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

	input = "<s>Sum two integers</s></s>def sum(a, b):\n return a + b</s>"
	tokenized_input = tokenizer(input, add_special_tokens= False)
	"""
	```

	Using model with Jax and Pytorch
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification

	#Load model with jax
	model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

	#Load model with torch
	model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
	```

	## Limitations
	This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.

	It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result.

	## Additional information
	### Licensing Information

	MIT License

	### Citation Information

	```
	@article{manh2023vault,
	title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
	author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
	journal={arXiv preprint arXiv:2305.06156},
	year={2023}
	}
	```