Model Card for Model ID

This model is a fine-tuned version of the CodeGemma-2B base model that generates high-quality docstrings for Python code functions.

Model Details

Model Description

The DocuMint model is a fine-tuned variant of Google's CodeGemma-2B base model, which was originally trained to predict the next token on internet text without any instructions. The DocuMint model has been fine-tuned using supervised instruction fine-tuning on a dataset of 100,000 Python functions and their respective docstrings extracted from the Free and open-source software (FOSS) ecosystem. The fine-tuning was performed using Low-Rank Adaptation (LoRA).

The goal of the DocuMint model is to generate docstrings that are concise (brief and to the point), complete (cover functionality, parameters, return values, and exceptions), and clear (use simple language and avoid ambiguity).

Developed by: Bibek Poudel, Adam Cook, Sekou Traore, Shelah Ameli (University of Tennessee, Knoxville)
Model type: Causal language model fine-tuned for code documentation generation
Language(s) (NLP): English, Python
License: MIT
Finetuned from model: google/codegemma-2b

Model Sources

Repository: GitHub
Paper: DocuMint: Docstring Generation for Python using Small Language Models

Uses

Direct Use

The DocuMint model can be used directly to generate high-quality docstrings for Python functions. Given a Python function definition, the model will output a docstring in the format

"""<generated docstring>""".

Fine-tuning Details

Fine-tuning Data

The fine-tuning data consists of 100,000 Python functions and their docstrings extracted from popular open-source repositories in the FOSS ecosystem. Repositories were filtered based on metrics such as number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k) to focus on well-established and actively maintained projects.

Fine-tuning Hyperparameters

Hyperparameter	Value
Fine-tuning Method	LoRA
Epochs	4
Batch Size	8
Gradient Accumulation Steps	16
Initial Learning Rate	2e-4
LoRA Parameters	78,446,592
Training Tokens	185,040,896

Evaluation

Metrics

Accuracy: Measures the coverage of the generated docstring on code elements like input/output variables. Calculated using cosine similarity between the generated and expert docstring embeddings.
Conciseness: Measures the ability to convey information succinctly without verbosity. Calculated as a compression ratio between the compressed and original docstring sizes.
Clarity: Measures readability using simple, unambiguous language. Calculated using the Flesch-Kincaid readability score.

Model Inference

For running inference, PEFT must be used to load the fine-tuned model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

config = PeftConfig.from_pretrained(self.model_id)
model = AutoModelForCausalLM.from_pretrained("google/codegemma-2b", device_map = self.device)
fine_tuned_model = PeftModel.from_pretrained(model, "documint/CodeGemma2B-fine-tuned", device_map = self.device)

Hardware

Fine-tuning was performed using an Intel 12900K CPU, a Nvidia RTX-3090 GPU, and 64 GB RAM. Total fine-tuning time was 48 GPU hours.

Citation

BibTeX:

@article{poudel2024documint,
  title={DocuMint: Docstring Generation for Python using Small Language Models},
  author={Poudel, Bibek and Cook, Adam and Traore, Sekou and Ameli, Shelah},
  journal={arXiv preprint arXiv:2405.10243},
  year={2024}
}

Model Card Contact

For questions or more information, please contact: {bpoudel3,acook46,staore1,oameli}@vols.utk.edu

documint
/

google-codegemma-2b-documint