LisaMegaWatts commited on
Commit
20a2e9f
·
verified ·
1 Parent(s): 64a09b2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +57 -6
README.md CHANGED
@@ -1,12 +1,63 @@
1
  ---
2
- title: Pre Punctuation Processor
3
- emoji: 😻
4
- colorFrom: pink
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 6.6.0
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Pre-Punctuation Processor
3
+ emoji: 📜
4
+ colorFrom: yellow
5
+ colorTo: gray
6
  sdk: gradio
 
7
  app_file: app.py
8
  pinned: false
9
+ license: mit
10
+ tags:
11
+ - philosophy
12
+ - nlp
13
+ - training-data
14
+ - classical-texts
15
+ - character-level
16
  ---
17
 
18
+ # Pre-Punctuation Processor
19
+
20
+ A text processing pipeline that prepares ancient philosophical texts as training data for character-level language models, stripping them back to a pre-punctuation form faithful to how they were originally composed and spoken.
21
+
22
+ ## Why Pre-Punctuation?
23
+
24
+ The philosophical texts in this corpus — Aristotle, Plato, Euclid, Seneca, Epictetus, Marcus Aurelius — were composed in an era before modern punctuation existed. Ancient Greek was written in *scriptio continua*: an unbroken stream of uppercase letters with no spaces, no commas, no quotation marks, no paragraph breaks.
25
+
26
+ The first systematic punctuation was invented by **Aristophanes of Byzantium** (c. 257–185 BC), head librarian of the Library of Alexandria. He devised a system of single dots (*théseis*) placed at different heights to mark breathing pauses for readers:
27
+
28
+ - **stigmḕ mésē** (·) mid-level dot — a short pause (*komma*)
29
+ - **hypostigmḗ** (.) low dot — a medium pause (*kolon*)
30
+ - **stigmḕ teleía** (˙) high dot — a full stop (*periodos*)
31
+
32
+ This system was a reading aid, not part of the texts themselves. The words of the philosophers predated any notation for pauses or structure.
33
+
34
+ ## The Period as Pause Marker
35
+
36
+ This pipeline reduces all punctuation to a single mark: the **period** — a direct descendant of Aristophanes' dot system. In our output, the period functions not as a grammatical construct but as what it originally was: a marker for a pause in speech.
37
+
38
+ The resulting vocabulary is exactly **28 characters**: the 26 lowercase Latin letters, a space, and a period.
39
+
40
+ ## What This Tool Does
41
+
42
+ 1. **Strips all non-body content** — Prefaces, editor's notes, appendixes, transcriber corrections, publisher info, and source boilerplate (Gutenberg, MIT Classics, Internet Archive) are aggressively removed. Only the philosopher's own words remain.
43
+ 2. **Converts numerals to words** — Both Arabic (600 → "six hundred") and Roman (XIV → "fourteen") numerals become English words.
44
+ 3. **Normalizes to 28-char vocabulary** — Unicode normalized to ASCII, lowercased, all punctuation except period removed.
45
+ 4. **Chunks for training** — Text split into 40–256 character chunks at sentence boundaries.
46
+ 5. **Publishes to HuggingFace** — Train/validation splits pushed as a dataset for direct loading in notebooks.
47
+
48
+ ## Usage
49
+
50
+ **Drag and drop** a .txt, .epub, or .zip file, or paste a URL from Project Gutenberg, MIT Internet Classics, or the Internet Archive. The pipeline processes it and adds it to the corpus.
51
+
52
+ **Search the Internet Archive** to browse and add classical texts directly.
53
+
54
+ **Push to HuggingFace** to make the dataset available anywhere:
55
+
56
+ ```python
57
+ from datasets import load_dataset
58
+ ds = load_dataset("LisaMegaWatts/philosophy-corpus")
59
+ ```
60
+
61
+ ## Built for JuliaGPT
62
+
63
+ The output is designed for training a character-level GPT implemented in Julia, with a target vocabulary of 29 tokens (28 characters + BOS).