Upload thompson stylometry model
Browse files- README.md +211 -0
- config.json +31 -0
- generation_config.json +6 -0
- loss_logs.csv +0 -0
- model.safetensors +3 -0
- training_state.pt +3 -0
README.md
ADDED
|
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- text-generation
|
| 6 |
+
- gpt2
|
| 7 |
+
- stylometry
|
| 8 |
+
- thompson
|
| 9 |
+
- authorship-attribution
|
| 10 |
+
- literary-analysis
|
| 11 |
+
- computational-linguistics
|
| 12 |
+
datasets:
|
| 13 |
+
- contextlab/thompson-corpus
|
| 14 |
+
library_name: transformers
|
| 15 |
+
pipeline_tag: text-generation
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# ContextLab GPT-2 Ruth Plumly Thompson Stylometry Model
|
| 19 |
+
|
| 20 |
+
## Overview
|
| 21 |
+
|
| 22 |
+
This model is a GPT-2 language model trained exclusively on **13 books by Ruth Plumly Thompson** (1891-1976). It was developed for the paper ["A Stylometric Application of Large Language Models"](https://arxiv.org/abs/2510.21958) (Stropkay et al., 2025).
|
| 23 |
+
|
| 24 |
+
The model captures Ruth Plumly Thompson's unique writing style through intensive training on their corpus. By learning the statistical patterns, vocabulary, syntax, and thematic elements characteristic of Thompson's writing, this model enables:
|
| 25 |
+
|
| 26 |
+
- **Text generation** in the authentic style of Ruth Plumly Thompson
|
| 27 |
+
- **Authorship attribution** through cross-entropy loss comparison
|
| 28 |
+
- **Stylometric analysis** of literary works from early-to-mid 20th century America
|
| 29 |
+
- **Computational literary studies** exploring Thompson's distinctive voice
|
| 30 |
+
|
| 31 |
+
This model is part of a suite of 8 author-specific models developed to demonstrate that language model perplexity can serve as a robust measure of stylistic similarity.
|
| 32 |
+
|
| 33 |
+
**⚠️ Important:** This model generates **lowercase text only**, as all training data was preprocessed to lowercase. Use lowercase prompts for best results.
|
| 34 |
+
|
| 35 |
+
## Model Details
|
| 36 |
+
|
| 37 |
+
- **Model type:** GPT-2 (custom compact architecture)
|
| 38 |
+
- **Language:** English (lowercase)
|
| 39 |
+
- **License:** MIT
|
| 40 |
+
- **Author:** Ruth Plumly Thompson (1891-1976)
|
| 41 |
+
- **Notable works:** The Oz book series (books 15-35)
|
| 42 |
+
- **Training data:** [13 books by Ruth Plumly Thompson](https://huggingface.co/datasets/contextlab/thompson-corpus)
|
| 43 |
+
- **Training tokens:** 733,171
|
| 44 |
+
- **Final training loss:** 1.3178
|
| 45 |
+
- **Epochs trained:** 50,000
|
| 46 |
+
|
| 47 |
+
### Architecture
|
| 48 |
+
|
| 49 |
+
| Parameter | Value |
|
| 50 |
+
|-----------|-------|
|
| 51 |
+
| Layers | 8 |
|
| 52 |
+
| Embedding dimension | 128 |
|
| 53 |
+
| Attention heads | 8 |
|
| 54 |
+
| Context length | 1024 tokens |
|
| 55 |
+
| Vocabulary size | 50,257 (GPT-2 tokenizer) |
|
| 56 |
+
| Total parameters | ~8.1M |
|
| 57 |
+
|
| 58 |
+
## Usage
|
| 59 |
+
|
| 60 |
+
### Basic Text Generation
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 64 |
+
import torch
|
| 65 |
+
|
| 66 |
+
# Load model and tokenizer
|
| 67 |
+
model = GPT2LMHeadModel.from_pretrained("contextlab/gpt2-thompson")
|
| 68 |
+
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
|
| 69 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 70 |
+
|
| 71 |
+
# IMPORTANT: Use lowercase prompts (model trained on lowercase text)
|
| 72 |
+
prompt = "once upon a time in the land of"
|
| 73 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 74 |
+
|
| 75 |
+
# Generate text
|
| 76 |
+
with torch.no_grad():
|
| 77 |
+
outputs = model.generate(
|
| 78 |
+
**inputs,
|
| 79 |
+
max_length=200,
|
| 80 |
+
do_sample=True,
|
| 81 |
+
temperature=0.8,
|
| 82 |
+
top_p=0.9,
|
| 83 |
+
pad_token_id=tokenizer.eos_token_id
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 87 |
+
print(generated_text)
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
**Output:** Generates text in Ruth Plumly Thompson's distinctive style (all lowercase).
|
| 91 |
+
|
| 92 |
+
### Stylometric Analysis
|
| 93 |
+
|
| 94 |
+
Compare cross-entropy loss across multiple author models to determine authorship:
|
| 95 |
+
|
| 96 |
+
```python
|
| 97 |
+
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 98 |
+
import torch
|
| 99 |
+
|
| 100 |
+
# Load models for different authors
|
| 101 |
+
authors = ['austen', 'dickens', 'twain'] # Example subset
|
| 102 |
+
models = {
|
| 103 |
+
author: GPT2LMHeadModel.from_pretrained(f"contextlab/gpt2-{author}")
|
| 104 |
+
for author in authors
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
|
| 108 |
+
|
| 109 |
+
# Test passage (lowercase)
|
| 110 |
+
test_text = "your test passage here in lowercase"
|
| 111 |
+
inputs = tokenizer(test_text, return_tensors="pt")
|
| 112 |
+
|
| 113 |
+
# Compute loss for each model
|
| 114 |
+
for author, model in models.items():
|
| 115 |
+
model.eval()
|
| 116 |
+
with torch.no_grad():
|
| 117 |
+
outputs = model(**inputs, labels=inputs['input_ids'])
|
| 118 |
+
loss = outputs.loss.item()
|
| 119 |
+
print(f"{author}: {loss:.4f}")
|
| 120 |
+
|
| 121 |
+
# Lower loss indicates more similar style (likely author)
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
## Training Procedure
|
| 125 |
+
|
| 126 |
+
### Dataset
|
| 127 |
+
|
| 128 |
+
The model was trained on the complete works of Ruth Plumly Thompson sourced from [Project Gutenberg](https://www.gutenberg.org/). The text was preprocessed to:
|
| 129 |
+
- Remove Project Gutenberg headers and footers
|
| 130 |
+
- Convert all text to lowercase
|
| 131 |
+
- Remove chapter headings and non-narrative text
|
| 132 |
+
- Preserve punctuation and structure
|
| 133 |
+
|
| 134 |
+
See the [Thompson corpus dataset](https://huggingface.co/datasets/contextlab/thompson-corpus) for details.
|
| 135 |
+
|
| 136 |
+
### Hyperparameters
|
| 137 |
+
|
| 138 |
+
| Parameter | Value |
|
| 139 |
+
|-----------|-------|
|
| 140 |
+
| Context length | 1,024 tokens |
|
| 141 |
+
| Batch size | 16 |
|
| 142 |
+
| Learning rate | 5×10⁻⁵ |
|
| 143 |
+
| Optimizer | AdamW |
|
| 144 |
+
| Training tokens | 733,171 |
|
| 145 |
+
| Epochs | 50,000 |
|
| 146 |
+
| Final loss | 1.3178 |
|
| 147 |
+
|
| 148 |
+
### Training Method
|
| 149 |
+
|
| 150 |
+
The model was initialized with a compact GPT-2 architecture (8 layers, 128-dimensional embeddings) and trained exclusively on Ruth Plumly Thompson's works until reaching a training loss of approximately 1.3178. This intensive training enables the model to capture fine-grained stylistic patterns characteristic of Thompson's writing.
|
| 151 |
+
|
| 152 |
+
See the [GitHub repository](https://github.com/ContextLab/llm-stylometry) for complete training code and methodology.
|
| 153 |
+
|
| 154 |
+
## Intended Use
|
| 155 |
+
|
| 156 |
+
### Primary Uses
|
| 157 |
+
- **Research:** Stylometric analysis, authorship attribution studies
|
| 158 |
+
- **Education:** Demonstrations of computational stylometry
|
| 159 |
+
- **Creative:** Generate text in Ruth Plumly Thompson's style
|
| 160 |
+
- **Analysis:** Compare writing styles across historical periods
|
| 161 |
+
|
| 162 |
+
### Out-of-Scope Uses
|
| 163 |
+
This model is not intended for:
|
| 164 |
+
- Factual information retrieval
|
| 165 |
+
- Modern language generation
|
| 166 |
+
- Tasks requiring uppercase text
|
| 167 |
+
- Commercial publication without attribution
|
| 168 |
+
|
| 169 |
+
## Limitations
|
| 170 |
+
|
| 171 |
+
- **Lowercase only:** All generated text is lowercase (due to preprocessing)
|
| 172 |
+
- **Historical language:** Reflects early-to-mid 20th century America vocabulary and grammar
|
| 173 |
+
- **Training data bias:** Limited to Ruth Plumly Thompson's published works
|
| 174 |
+
- **Small model:** Compact architecture prioritizes training speed over generation quality
|
| 175 |
+
- **No factual grounding:** Generates stylistically similar text, not historically accurate content
|
| 176 |
+
|
| 177 |
+
## Evaluation
|
| 178 |
+
|
| 179 |
+
This model achieved perfect accuracy (100%) in distinguishing Ruth Plumly Thompson's works from seven other classic authors in cross-entropy loss comparisons. See the paper for detailed evaluation results.
|
| 180 |
+
|
| 181 |
+
## Citation
|
| 182 |
+
|
| 183 |
+
If you use this model in your research, please cite:
|
| 184 |
+
|
| 185 |
+
```bibtex
|
| 186 |
+
@article{StroEtal25,
|
| 187 |
+
title={A Stylometric Application of Large Language Models},
|
| 188 |
+
author={Stropkay, Harrison F. and Chen, Jiayi and Jabelli, Mohammad J. L. and Rockmore, Daniel N. and Manning, Jeremy R.},
|
| 189 |
+
journal={arXiv preprint arXiv:2510.21958},
|
| 190 |
+
year={2025}
|
| 191 |
+
}
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
## Contact
|
| 195 |
+
|
| 196 |
+
- **Paper & Code:** https://github.com/ContextLab/llm-stylometry
|
| 197 |
+
- **Issues:** https://github.com/ContextLab/llm-stylometry/issues
|
| 198 |
+
- **Contact:** Jeremy R. Manning (jeremy.r.manning@dartmouth.edu)
|
| 199 |
+
- **Lab:** [Context Lab](https://www.context-lab.com/), Dartmouth College
|
| 200 |
+
|
| 201 |
+
## Related Models
|
| 202 |
+
|
| 203 |
+
Explore models for all 8 authors in the study:
|
| 204 |
+
- [Jane Austen](https://huggingface.co/contextlab/gpt2-austen)
|
| 205 |
+
- [L. Frank Baum](https://huggingface.co/contextlab/gpt2-baum)
|
| 206 |
+
- [Charles Dickens](https://huggingface.co/contextlab/gpt2-dickens)
|
| 207 |
+
- [F. Scott Fitzgerald](https://huggingface.co/contextlab/gpt2-fitzgerald)
|
| 208 |
+
- [Herman Melville](https://huggingface.co/contextlab/gpt2-melville)
|
| 209 |
+
- [Ruth Plumly Thompson](https://huggingface.co/contextlab/gpt2-thompson)
|
| 210 |
+
- [Mark Twain](https://huggingface.co/contextlab/gpt2-twain)
|
| 211 |
+
- [H.G. Wells](https://huggingface.co/contextlab/gpt2-wells)
|
config.json
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"activation_function": "gelu_new",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"GPT2LMHeadModel"
|
| 5 |
+
],
|
| 6 |
+
"attn_pdrop": 0.1,
|
| 7 |
+
"bos_token_id": 50256,
|
| 8 |
+
"dtype": "float32",
|
| 9 |
+
"embd_pdrop": 0.1,
|
| 10 |
+
"eos_token_id": 50256,
|
| 11 |
+
"initializer_range": 0.02,
|
| 12 |
+
"layer_norm_epsilon": 1e-05,
|
| 13 |
+
"model_type": "gpt2",
|
| 14 |
+
"n_embd": 128,
|
| 15 |
+
"n_head": 8,
|
| 16 |
+
"n_inner": null,
|
| 17 |
+
"n_layer": 8,
|
| 18 |
+
"n_positions": 1024,
|
| 19 |
+
"reorder_and_upcast_attn": false,
|
| 20 |
+
"resid_pdrop": 0.1,
|
| 21 |
+
"scale_attn_by_inverse_layer_idx": false,
|
| 22 |
+
"scale_attn_weights": true,
|
| 23 |
+
"summary_activation": null,
|
| 24 |
+
"summary_first_dropout": 0.1,
|
| 25 |
+
"summary_proj_to_labels": true,
|
| 26 |
+
"summary_type": "cls_index",
|
| 27 |
+
"summary_use_proj": true,
|
| 28 |
+
"transformers_version": "4.56.1",
|
| 29 |
+
"use_cache": true,
|
| 30 |
+
"vocab_size": 50257
|
| 31 |
+
}
|
generation_config.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_from_model_config": true,
|
| 3 |
+
"bos_token_id": 50256,
|
| 4 |
+
"eos_token_id": 50256,
|
| 5 |
+
"transformers_version": "4.56.1"
|
| 6 |
+
}
|
loss_logs.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:42ef45d147ff283af8e537968a5a366c5abe84a1842679df263c8a1df2b3bca3
|
| 3 |
+
size 32611312
|
training_state.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e40e8d432075761d4b81bbe78cbf21642b9b13d931dd19814996a1732ff0464
|
| 3 |
+
size 65304983
|