GPT2 PyCode

This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.

Model Description

This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.

Developed by: Maharnab Saikia
Model type: Language model
Language(s) (NLP): English
License: MIT
Finetuned from model: GPT2 124M

Uses

Research: Studying the behavior of small-scale language models in code generation tasks
Benchmarking: Providing a baseline for comparing different model architectures or training strategies
Rapid Prototyping: Quick tests of code generation ideas without the overhead of larger models
Education: Demonstrating the principles of fine-tuning language models for specific tasks

Bias, Risks, and Limitations

It's crucial to understand the limitations of this model:

Limited knowledge base due to the small training corpus
May struggle with complex or specialized Python code
Not suitable for production-level code generation tasks
Performance will likely be significantly lower than larger, more comprehensively trained models

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)

prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)

input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']

output = model.generate(
    input_ids, 
    max_length=512, 
    num_return_sequences=1, 
    no_repeat_ngram_size=2,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id
)

generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)

print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")

Training Details

Training Data

Model: GPT with 124 million parameters
Training Data: 25,000 Python code samples
Fine-tuning: Adapted specifically for Python code generation tasks

Training Hyperparameters

Epochs: 5
Batch Size: 8
Learning Rate: 5e-5
Contex Window: 512

Environmental Impact

Carbon emissions was estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: P100 GPU
Hours used: 5
Cloud Provider: Kaggle
Compute Region: South Asia
Carbon Emitted: 1.15

Acknowledgements

This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.

maharnab
/

gpt2_pycode