CodeEncoderDecoderModel-Ghost-large๐Ÿ‘ป

A multilingual encoder-decoder model for generating docstrings from code snippets.
It is based on a custom BERT-style encoder pretrained on source code (CodeModernBERT-Ghost) and a large-scale decoder model (GPT2-large).

๐Ÿ—๏ธ Model Architecture

๐ŸŽฏ Intended Use

  • Generating docstrings (documentation comments) for functions or methods in multiple languages.
  • Summarizing code for educational or review purposes.
  • Assisting in automated documentation generation pipelines.

Supported languages (code input):

  • Python
  • Java

๐Ÿ“ฆ How to Use

from transformers import AutoTokenizer, EncoderDecoderModel
import torch

model = EncoderDecoderModel.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large").to("cuda")
encoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="encoder_tokenizer")
decoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="decoder_tokenizer")

if decoder_tokenizer.pad_token is None:
    decoder_tokenizer.pad_token = decoder_tokenizer.eos_token

code = '''
def greet(name):
    return f"Hello, {name}!"
'''

inputs = encoder_tokenizer(code, return_tensors="pt", truncation=True, padding=True, max_length=2048).to("cuda")
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_length=256,
    num_beams=5,
    early_stopping=True,
    decoder_start_token_id=model.config.decoder_start_token_id,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.pad_token_id,
    no_repeat_ngram_size=2
)

docstring = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(docstring)

๐Ÿงช Training Details

  • Task: Code-to-docstring generation
  • Dataset: CodeXGLUE: Code-to-Text โ€“ using subsets of Python, Java, JavaScript, Go, Ruby, PHP
  • Loss: Cross-entropy loss over tokenized docstrings
  • Max input length: 2048 (encoder), max output length: 256 (decoder)
  • Decoder modifications: Adapted GPT2-large with padding and cross-attention

โš ๏ธ Limitations & Risks

  1. Generated documentation may be inaccurate, incomplete, or misleading. Always review generated docstrings manually.
  2. Formatting may not follow specific standards (e.g., Google/Numpy style in Python or full Javadoc).
  3. Limited context: Only considers single-function input; lacks broader project-level understanding.
  4. Language variance: Performance may differ depending on the programming language due to data distribution.
  5. โš ๏ธ Decoder risks (GPT2-large):
    GPT-2 models are known to sometimes generate inappropriate, offensive, or biased outputs, depending on the prompt.
    Although this model is fine-tuned on technical datasets (code-docstring pairs), due to inherited properties from gpt2-large, similar risks may still be present in edge cases. Please exercise caution, especially when using the model in public or educational settings.

๐Ÿ“„ License

Apache-2.0
Model weights and tokenizer artifacts are released under the same license. You are free to use, modify, and redistribute with attribution.

Downloads last month
33
Safetensors
Model size
1.16B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Shuu12121/CodeEncoderDecoderModel-Ghost-large

Finetuned
(1)
this model

Space using Shuu12121/CodeEncoderDecoderModel-Ghost-large 1