codeparrot-gpt2-mi50-eos-ft

A fine-tune of Marcoson320/codeparrot-gpt2-mi50 experimenting with adding end-of-sequence (<|endoftext|>) emission behavior to a code-completion model trained without document-boundary EOS tokens.

This is a feasibility study β€” partial success (3/5 on a small probe set). Published for transparency about the method and limitations.

Motivation

The base model was trained following HuggingFace LLM Course Chapter 7.6, whose tokenize function does not insert <|endoftext|> at document boundaries:

def tokenize(element):
    outputs = tokenizer(
        element["content"], truncation=True,
        max_length=128, return_overflowing_tokens=True, ...
    )
    # no EOS inserted between documents

GPT-2 paper and karpathy/nanoGPT both insert an end-of-text token between documents; the course's simplification produces a model that does not stop generating at natural boundaries.

This fine-tune attempts to retrofit the EOS signal post-hoc.

Fine-tune Configuration

Item Value
Base Marcoson320/codeparrot-gpt2-mi50 (final checkpoint)
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.999, weight_decay=0.1)
Learning rate 5Γ—10⁻⁡, cosine schedule, 100 warmup steps
Effective batch size 256 (per_device_bs=32 Γ— grad_accum=4 Γ— world_size=2)
Steps 4,000
Precision fp16
Parallelism DistributedDataParallel on 2 Γ— MI50
Wall clock ~1h 44m
final train_loss 1.162
final eval_loss 1.569

Data preparation

Each training sample is a variable-length token slice (32–120 tokens) of a Python file, with <|endoftext|> (id 0) appended explicitly, then padded to 128 with a label mask of -100 (so padding does not contribute to loss).

This raises EOS signal density from < 0.04% (base training) to ~3%.

Results

Tested on five self-contained prompts, comparing base vs fine-tuned model. EOS emission within 80 generated tokens, greedy decoding, repetition_penalty=1.12, no_repeat_ngram_size=4:

Prompt Base EOS-FT
def add(a, b):\n return a + b\n βœ— βœ“ (pos 13)
def square(x):\n return x * x\n\n βœ— βœ—
def greet(name):\n print(f'Hello {name}')\n\n βœ— βœ“ (pos 47)
import os\nprint(os.getcwd())\n βœ— βœ—
x = 1\ny = 2\nz = x + y\n βœ— βœ“ (pos 61)
Total emit rate 0/5 3/5

Known limitations

  • Partial success: 2 of 5 prompts still do not stop within 80 tokens.
  • EOS position quality is imperfect: the model sometimes emits EOS mid-expression rather than at a clean function/statement boundary. This is attributable to the data preparation β€” random-length chunks rather than AST-based semantic units. A more rigorous approach would slice each file into complete FunctionDef / ClassDef blocks via ast.parse so the model only sees EOS at structural endpoints.
  • Code quality of pre-EOS content inherits the base model's small-scale artifacts (occasional Jupyter notebook markers, partial idioms).

Usage

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="Marcoson320/codeparrot-gpt2-mi50-eos-ft",
    device=0,
)

out = pipe(
    "def add(a, b):\n    return a + b\n",
    max_new_tokens=80,
    do_sample=False,
    repetition_penalty=1.12,
    no_repeat_ngram_size=4,
)
print(out[0]["generated_text"])

Reproduction

Training script and full method documentation: bundle on the project HTTP server (LAN only). Source code mirrored alongside this model is the published train_eos_v2.py and test_eos.py.

Related

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Marcoson320/codeparrot-gpt2-mi50-eos-ft

Finetuned
(1)
this model

Dataset used to train Marcoson320/codeparrot-gpt2-mi50-eos-ft