GPT-2 Small — LoRA Screenplay Adapter (Apple Silicon / MPS)

Study Context: This is the second model in a dual-architecture comparative study on screenplay generation using GPT-2 Small. The first model was a full-parameter fine-tune executed on a cloud NVIDIA T4 GPU. This adapter was trained entirely on consumer edge hardware — an Apple Silicon MacBook Air — using Low-Rank Adaptation (LoRA) to operate within the hard constraints of a fanless, unified-memory device.

Loading Notice — This is a PEFT Adapter

This repository contains LoRA adapter weights only, not a standalone model. You cannot load it with AutoModelForCausalLM.from_pretrained() directly. You must load the base gpt2 model and wrap it using peft.PeftModel. See the Usage section for the correct loading pattern.

Model Description

This model is a LoRA (Low-Rank Adaptation) fine-tune of OpenAI's GPT-2 Small (124M parameters), targeting causal language modeling for professional screenplay generation. Rather than updating all 124M parameters, LoRA injects trainable rank-decomposition matrices exclusively into the attention blocks (c_attn), leaving the base model frozen.

Property	Value
Base Model	GPT-2 Small (`openai-community/gpt2`)
Fine-tune Method	PEFT / LoRA (Low-Rank Adaptation)
Total Parameters	124,439,808 (base, frozen)
Trainable Parameters	294,912
% of Network Updated	0.2364%
Target Modules	`c_attn` (attention projection layers)
Task	Causal Language Modeling / Screenplay Generation
Training Backend	MPS (Metal Performance Shaders) via PyTorch

By updating only 294,912 parameters instead of 124 million, the entire training run was made feasible on hardware that would otherwise fail within minutes under a full-parameter regime.

Hardware & MLOps Optimizations

The Constraint Problem

A full-parameter fine-tuning attempt on the same MacBook Air was abandoned early. With gradient accumulation and a full optimizer state spanning all 124M parameters, the pipeline averaged 103 seconds per step — a pace that would have required over 133 hours to complete the same 4,700-step schedule, making it operationally non-viable on a thermally passive device.

The LoRA Solution

Switching to LoRA with the following configuration resolved all three critical constraints simultaneously:

Constraint	Full-Parameter Outcome	LoRA Outcome
Memory (OOM)	Frequent crashes	Stable — optimizer state ~2.3MB
Thermal throttling	Sustained throttle >30min	No throttling across 7h 51m run
Step throughput	~103 seconds/step	~6.01 seconds/step (17× faster)

The LoRA adapter's optimizer state is proportional only to trainable parameters (294,912), not the full network — this is what enabled the MPS backend to maintain sustained throughput on unified memory without page faults or thermal shutdown.

Compute Profile

Property	Value
Hardware	Apple MacBook Air M2 Base (Unified Memory)
Compute Backend	PyTorch MPS (Metal Performance Shaders)
Precision	Default MPS precision
Optimizer	AdamW
Batch Size	`per_device_train_batch_size = 4`
Gradient Accumulation	Disabled (memory constraint)
Avg. Step Throughput	~6.01 seconds/step
Total Training Time	7 hours, 51 minutes, 2 seconds

Training Metrics

Dataset Coverage

The full screenplay corpus used in this study contains approximately 94 million tokens. Due to the step budget constraint of a local run, this adapter was trained on approximately 51% of the corpus (0.51 epoch coverage), compared to the full-parameter cloud model which completed a full epoch.

Property	Value
Total Corpus Size	~94 million tokens
Epoch Coverage	0.51 (51% of corpus)
Total Steps	4,700

Loss Convergence

Metric	Value
Final Training Loss	1.9806
Final Evaluation Loss	2.4017

MLOps Trade-off Assessment

The train/eval loss gap (1.98 → 2.40) reflects two compounding constraints inherent to this training configuration:

Partial corpus coverage. At 0.51 epochs, the model has not converged on the full vocabulary and structural distribution of the screenplay corpus. The full-parameter cloud model, which completed a full epoch, achieved a final validation loss of 1.3194 — a gap of ~1.08 loss units attributable to both architecture and data coverage.
LoRA's intentional frozen-base trade-off. LoRA achieves its memory efficiency by keeping 99.76% of the network frozen. This is architecturally correct for adapter-based transfer learning, but imposes an upper bound on how deeply the model can reshape its internal representations compared to a full-parameter overwrite.

This is not a model failure. The adapter successfully acquired structural screenplay formatting conventions — scene sluglines, character cues, dialogue block structure — within a training envelope that would be impossible for full fine-tuning on the same device. It represents a calibrated, deliberate engineering trade-off: edge-feasibility over depth of convergence.

Usage & Inference

Installation

pip install transformers peft torch

Loading the Adapter

Because this is a PEFT adapter, loading requires two steps: initialize the frozen base model, then wrap it with the adapter weights.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from peft import PeftModel

base_model_id = "openai-community/gpt2"
adapter_id = "raghavnimbalkar/gpt2-screenplay-mac-lora"  

# --- Device Selection ---
if torch.backends.mps.is_available():
    device = torch.device("mps")       # Apple Silicon
elif torch.cuda.is_available():
    device = torch.device("cuda")      # NVIDIA GPU
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

# --- Load Base Model + Adapter ---
tokenizer = GPT2Tokenizer.from_pretrained(base_model_id)

base_model = GPT2LMHeadModel.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base_model, adapter_id)

model = model.to(device)
model.eval()

Running Inference

prompt = "INT. ABANDONED WAREHOUSE - NIGHT\n\nRAIN hammers the corrugated roof. DETECTIVE COLE moves through the dark, flashlight cutting the shadows."

inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_length=512,
        temperature=0.85,
        top_p=0.92,
        repetition_penalty=1.15,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Recommended Sampling Parameters

Parameter	Recommended Value	Notes
`max_length`	Up to `512`	GPT-2 context window limit
`temperature`	`0.85`	Allows moderate creative variance
`top_p`	`0.92`	Nucleus sampling threshold
`repetition_penalty`	`1.15`	Essential — prevents screenplay boilerplate looping
`do_sample`	`True`	Required for temperature/top_p sampling to activate

Optional: Merging Adapter Weights

If you want a single standalone model file (e.g., for faster inference without the PEFT library dependency), you can merge the adapter into the base model and save:

merged_model = model.merge_and_unload()
merged_model.save_pretrained("./screenplay-gpt2-lora-merged")
tokenizer.save_pretrained("./screenplay-gpt2-lora-merged")

Note: The merged model will be ~500MB (full GPT-2 Small size) rather than the ~1.2MB adapter. The merged weights are mathematically identical to using PeftModel — this is purely a deployment convenience.

Comparison with Full-Parameter Model

This adapter is one half of an ongoing comparative study. The table below summarizes the key architectural and performance differences between both trained models.

Property	Full-Parameter (Cloud)	LoRA Adapter (Local)
Hardware	NVIDIA T4 (Cloud)	Apple Silicon MacBook Air (MPS)
Trainable Params	124,439,808 (100%)	294,912 (0.24%)
Epoch Coverage	1.0 (full corpus)	0.51 (half corpus)
Total Steps	9,272	4,700
Training Time	7h 43m 30s	7h 51m 02s
Final Eval Loss	1.3194	2.4017
Step Throughput	~3.0s/step (T4)	~6.01s/step (MPS)
MLOps Event	Hardware preemption + hot-resume	17× speedup via LoRA optimization

Both models spent approximately the same wall-clock time training. The divergence in final loss is a direct reflection of full-parameter depth vs. adapter-based efficiency — not a difference in compute investment.

Intended Use

Intended uses:

Screenplay drafting assistance and scene continuation on consumer hardware
Comparative reference point for PEFT vs. full fine-tuning studies on GPT-2
Offline, locally runnable script generation (no cloud dependency after download)
Research into LoRA effectiveness on structured, domain-specific creative text

Out-of-scope uses:

Production script generation without editorial review
Factual or knowledge-retrieval tasks
Any application requiring output truthfulness or citation

Bias, Risks, and Limitations

Trained on an unfiltered screenplay corpus; outputs may reflect mature themes, stereotypes, or biases present in the training data.
At 0.51 epoch coverage, the model's understanding of the full screenplay vocabulary distribution is incomplete. Long-form coherence is limited.
The train/eval loss gap suggests moderate overfitting on seen structural patterns. Outputs are more formulaic than the full-parameter counterpart.
No RLHF or safety fine-tuning has been applied.

Citation

@article{radford2019language,
  title   = {Language Models are Unsupervised Multitask Learners},
  author  = {Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year    = {2019}
}

@article{hu2021lora,
  title   = {LoRA: Low-Rank Adaptation of Large Language Models},
  author  = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  year    = {2021},
  journal = {arXiv preprint arXiv:2106.09685}
}

Model Card Contact

For questions about methodology, training configuration, or the broader comparative study, please open an issue in this repository.

Downloads last month: 45

Model tree for raghavnimbalkar/gpt2-screenplay-mac-lora

Base model

openai-community/gpt2

Adapter

(1696)

this model

Dataset used to train raghavnimbalkar/gpt2-screenplay-mac-lora

Paper for raghavnimbalkar/gpt2-screenplay-mac-lora

LoRA: Low-Rank Adaptation of Large Language Models

Paper • 2106.09685 • Published Jun 17, 2021 • 61