--- license: apache-2.0 language: - en - multilingual tags: - code-to-docstring - code-summarization - code-documentation - encoder-decoder - code - python - java - transformers - huggingface - modernbert - gpt2 base_model: - Shuu12121/CodeModernBERT-Ghost - openai-community/gpt2-large pipeline_tag: text2text-generation --- # CodeEncoderDecoderModel-Ghost-large👻 A multilingual encoder-decoder model for generating **docstrings from code snippets**. It is based on a custom BERT-style encoder pretrained on source code (`CodeModernBERT-Ghost`) and a large-scale decoder model (`GPT2-large`). ## 🏗️ Model Architecture - **Encoder:** [`Shuu12121/CodeModernBERT-Ghost`](https://huggingface.co/Shuu12121/CodeModernBERT-Ghost) - **Decoder:** [`openai-community/gpt2-large`](https://huggingface.co/openai-community/gpt2-large) - Connected via HuggingFace's `EncoderDecoderModel` with cross-attention. ## 🎯 Intended Use - Generating docstrings (documentation comments) for functions or methods in multiple languages. - Summarizing code for educational or review purposes. - Assisting in automated documentation generation pipelines. Supported languages (code input): - Python - Java ## 📦 How to Use ```python from transformers import AutoTokenizer, EncoderDecoderModel import torch model = EncoderDecoderModel.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large").to("cuda") encoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="encoder_tokenizer") decoder_tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeEncoderDecoderModel-Ghost-large", subfolder="decoder_tokenizer") if decoder_tokenizer.pad_token is None: decoder_tokenizer.pad_token = decoder_tokenizer.eos_token code = ''' def greet(name): return f"Hello, {name}!" ''' inputs = encoder_tokenizer(code, return_tensors="pt", truncation=True, padding=True, max_length=2048).to("cuda") outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=256, num_beams=5, early_stopping=True, decoder_start_token_id=model.config.decoder_start_token_id, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.pad_token_id, no_repeat_ngram_size=2 ) docstring = decoder_tokenizer.decode(outputs[0], skip_special_tokens=True) print(docstring) ``` ## 🧪 Training Details - **Task:** Code-to-docstring generation - **Dataset:** [CodeXGLUE: Code-to-Text](https://github.com/microsoft/CodeXGLUE) – using subsets of Python, Java, JavaScript, Go, Ruby, PHP - **Loss:** Cross-entropy loss over tokenized docstrings - **Max input length:** 2048 (encoder), max output length: 256 (decoder) - **Decoder modifications:** Adapted GPT2-large with padding and cross-attention ## ⚠️ Limitations & Risks 1. **Generated documentation may be inaccurate, incomplete, or misleading**. Always review generated docstrings manually. 2. **Formatting may not follow specific standards** (e.g., Google/Numpy style in Python or full Javadoc). 3. **Limited context:** Only considers single-function input; lacks broader project-level understanding. 4. **Language variance:** Performance may differ depending on the programming language due to data distribution. 5. **⚠️ Decoder risks (GPT2-large):** GPT-2 models are known to sometimes generate inappropriate, offensive, or biased outputs, depending on the prompt. Although this model is fine-tuned on technical datasets (code-docstring pairs), due to inherited properties from `gpt2-large`, similar risks **may still be present** in edge cases. Please exercise caution, especially when using the model in public or educational settings. ## 📄 License Apache-2.0 Model weights and tokenizer artifacts are released under the same license. You are free to use, modify, and redistribute with attribution.