Spaces:

LisaMegaWatts
/

pre-punctuation-processor

Sleeping

App Files Files Community

pre-punctuation-processor / README.md

LisaMegaWatts

Upload README.md with huggingface_hub

20a2e9f verified about 1 month ago

preview code

raw

history blame contribute delete

3.18 kB

A newer version of the Gradio SDK is available: 6.10.0

Upgrade

metadata

title: Pre-Punctuation Processor
emoji: 📜
colorFrom: yellow
colorTo: gray
sdk: gradio
app_file: app.py
pinned: false
license: mit
tags:
  - philosophy
  - nlp
  - training-data
  - classical-texts
  - character-level

Pre-Punctuation Processor

A text processing pipeline that prepares ancient philosophical texts as training data for character-level language models, stripping them back to a pre-punctuation form faithful to how they were originally composed and spoken.

Why Pre-Punctuation?

The philosophical texts in this corpus — Aristotle, Plato, Euclid, Seneca, Epictetus, Marcus Aurelius — were composed in an era before modern punctuation existed. Ancient Greek was written in scriptio continua: an unbroken stream of uppercase letters with no spaces, no commas, no quotation marks, no paragraph breaks.

The first systematic punctuation was invented by Aristophanes of Byzantium (c. 257–185 BC), head librarian of the Library of Alexandria. He devised a system of single dots (théseis) placed at different heights to mark breathing pauses for readers:

stigmḕ mésē (·) mid-level dot — a short pause (komma)
hypostigmḗ (.) low dot — a medium pause (kolon)
stigmḕ teleía (˙) high dot — a full stop (periodos)

This system was a reading aid, not part of the texts themselves. The words of the philosophers predated any notation for pauses or structure.

The Period as Pause Marker

This pipeline reduces all punctuation to a single mark: the period — a direct descendant of Aristophanes' dot system. In our output, the period functions not as a grammatical construct but as what it originally was: a marker for a pause in speech.

The resulting vocabulary is exactly 28 characters: the 26 lowercase Latin letters, a space, and a period.

What This Tool Does

Strips all non-body content — Prefaces, editor's notes, appendixes, transcriber corrections, publisher info, and source boilerplate (Gutenberg, MIT Classics, Internet Archive) are aggressively removed. Only the philosopher's own words remain.
Converts numerals to words — Both Arabic (600 → "six hundred") and Roman (XIV → "fourteen") numerals become English words.
Normalizes to 28-char vocabulary — Unicode normalized to ASCII, lowercased, all punctuation except period removed.
Chunks for training — Text split into 40–256 character chunks at sentence boundaries.
Publishes to HuggingFace — Train/validation splits pushed as a dataset for direct loading in notebooks.

Usage

Drag and drop a .txt, .epub, or .zip file, or paste a URL from Project Gutenberg, MIT Internet Classics, or the Internet Archive. The pipeline processes it and adds it to the corpus.

Search the Internet Archive to browse and add classical texts directly.

Push to HuggingFace to make the dataset available anywhere:

from datasets import load_dataset
ds = load_dataset("LisaMegaWatts/philosophy-corpus")

Built for JuliaGPT

The output is designed for training a character-level GPT implemented in Julia, with a target vocabulary of 29 tokens (28 characters + BOS).