goon / REPLICATION.md
Binx
Initial commit: analysis app, deployment config, UI improvements
da605e9
metadata
editor_options:
  markdown:
    wrap: 72

Replication Guide β€” Grasping 'Gooning'

An observational analysis of Reddit gooning communities via BERTopic topic modelling.


Project summary

This study empirically analyses publicly available Reddit subreddits dedicated to gooning (prolonged, trancelike masturbation). The analysis pipeline combines:

  • Descriptive statistics β€” demographics (age, gender, sexuality from user flair), word frequencies, and substance mentions across 30 subreddits
  • BERTopic topic modelling β€” semantic clustering of ~2.35M documents (posts and comments) to identify the major themes and discourse patterns

Full corpus: 30 subreddits, ~22M raw records, 2019–2025


Repository structure

Goon/
β”œβ”€β”€ REPLICATION.md                 ← this file
β”œβ”€β”€ README.md                      ← subreddit list
β”œβ”€β”€ Goon.Rproj                     ← RStudio project (sets working directory)
β”œβ”€β”€ data_prep.qmd                  ← Step 1: R β€” ingest CSVs, clean, export Parquet
β”œβ”€β”€ goon_analysis.qmd              ← Step 2: R β€” descriptive analyses
β”œβ”€β”€ goon_topic_analysis.qmd        ← Step 4: R β€” visualise BERTopic outputs
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ *.csv                      ← raw Reddit exports (one file per subreddit per year)
β”‚   β”œβ”€β”€ GOONED_comments.csv        ← pre-concatenated GOONED comments (all years)
β”‚   β”œβ”€β”€ GOONED_submissions.csv     ← pre-concatenated GOONED submissions (all years)
β”‚   β”œβ”€β”€ comments.parquet           ← output of data_prep.qmd (18.2M rows)
β”‚   β”œβ”€β”€ posts.parquet              ← output of data_prep.qmd (3.8M rows)
β”‚   β”œβ”€β”€ corpus_clean.parquet       ← output of topic_analysis.qmd (modelling corpus)
β”‚   └── corpus_deleted.parquet     ← deleted/removed rows (tracked separately)
β”œβ”€β”€ topic analysis/
β”‚   β”œβ”€β”€ topic_analysis.qmd         ← Step 3: Python β€” BERTopic pipeline
β”‚   β”œβ”€β”€ topic_api_labeling.qmd     ← Step 3b: Python β€” optional API-based topic labelling
β”‚   β”œβ”€β”€ build_topic_results_summary.py  ← helper: generate an HTML summary from run outputs
β”‚   β”œβ”€β”€ run_topic_analysis.sh      ← shell wrapper that sets env vars and calls quarto
β”‚   β”œβ”€β”€ README.md                  ← detailed pipeline documentation
β”‚   β”œβ”€β”€ MEMORY_MANAGEMENT.md       ← strategies for large-corpus runs
β”‚   └── runs/
β”‚       └── full_run_v1/           ← complete output of the full corpus run
β”‚           β”œβ”€β”€ bertopic/          ← saved BERTopic models + doc-topic assignments
β”‚           β”œβ”€β”€ topics/            ← keyword tables, summaries, evaluation, API labels
β”‚           └── figures/           ← generated charts
└── .venv/                         ← Python virtual environment (created per instructions below)

System requirements

Hardware

Component Minimum Recommended
RAM 32 GB 64 GB (full corpus run; see MEMORY_MANAGEMENT.md)
Disk 50 GB free 100 GB free
CPU Any modern multi-core 8+ cores for embedding generation
GPU Not required Optional β€” speeds up embedding by ~10Γ—

The full corpus run was executed on a Linux VM with 96 GB RAM. A local Mac with 24 GB will work for the pilot/sample runs but may struggle with the full corpus embedding step.

Software

Tool Version tested Notes
R β‰₯ 4.3 For data_prep.qmd and goon_analysis.qmd
RStudio / Quarto CLI β‰₯ 1.4 To render .qmd files
Python β‰₯ 3.10 For topic_analysis.qmd
Quarto β‰₯ 1.4 Installed with RStudio or standalone

Step-by-step replication

Step 0: Clone / obtain the repository

The data/ folder containing raw CSVs is required. These are large files and are not distributed via git β€” they must be present locally.

Open Goon.Rproj in RStudio. This sets the working directory to the project root so that here::here() paths resolve correctly.


Step 1: R data preparation (data_prep.qmd)

Purpose: Reads all raw CSVs, combines them into unified data frames, applies minimal cleaning, and exports data/comments.parquet and data/posts.parquet.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "data.table", "arrow", "here"))

Run:

Open data_prep.qmd in RStudio and click Render (or run all chunks in order).

Alternatively, from the terminal:

cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd

Expected outputs: - data/comments.parquet β€” ~18.2M rows, ~528 MB - data/posts.parquet β€” ~3.8M rows, ~186 MB

Time estimate: 15–45 minutes depending on RAM and I/O speed (the GOONED CSV files alone are ~7 GB).


Step 2: R descriptive analysis (goon_analysis.qmd)

Purpose: Demographic analysis (r/GOONEDmeetup flair), word frequency analysis, substance mention counts.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
                   "stringr", "here", "e1071", "tidytext",
                   "data.table", "arrow"))

Run:

Open goon_analysis.qmd in RStudio and click Render.

Prerequisites: data/comments.parquet and data/posts.parquet must exist (Step 1).

Expected outputs: - Rendered HTML with plots embedded - In-memory: word count tables, substance mention counts, demographic counts


Step 3: Python BERTopic topic modelling (topic analysis/topic_analysis.qmd)

Purpose: Embeds all documents with all-MiniLM-L6-v2, runs HDBSCAN topic modelling via BERTopic, reduces topics using c-TF-IDF agglomerative clustering, evaluates models, and exports all topic outputs.

3a. Create the Python virtual environment

cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate

pip install bertopic umap-learn hdbscan sentence-transformers \
            scikit-learn pandas pyarrow quarto

Key package versions (tested):

Package Version
bertopic 0.16.x
umap-learn 0.5.x
hdbscan 0.8.x
sentence-transformers 2.x
scikit-learn 1.x
pandas 2.x
pyarrow 14.x

3b. Run the pipeline

Pilot run (200k documents β€” recommended first):

cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k

Outputs land in topic analysis/runs/pilot_200k/.

Full corpus run (~2.35M cleaned documents after filtering):

cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1

Warning: The full run requires ~64 GB RAM for the embedding + UMAP stages. See topic analysis/MEMORY_MANAGEMENT.md for strategies if RAM is limited.

Pipeline stages (automatically executed in order):

  1. Data ingestion β€” loads data/posts.parquet + data/comments.parquet
  2. Light cleaning β€” deduplication, URL/username/subreddit anonymisation, markdown stripping; deleted/removed rows saved separately to data/corpus_deleted.parquet
  3. Embedding β€” all-MiniLM-L6-v2 (384-dim), batch_size=512, saved as shards; skips already-generated shards on resume
  4. UMAP β€” pre-computed once, reused across all HDBSCAN configs
  5. BERTopic β€” 6 configurations: min_cluster_size ∈ {50, 100, 200} Γ— method ∈ {eom, leaf}
  6. Topic reduction β€” c-TF-IDF agglomerative clustering to 100, 50, and 25 topics
  7. Evaluation β€” NPMI coherence, topic diversity, outlier rates
  8. Export β€” keyword tables, representative docs, summary CSVs

Reproducibility: Random seed = 42 throughout. A reproducibility_log.json is written to the run folder with all settings and package versions.

3c. Optional API-based topic labelling (topic_api_labeling.qmd)

Sends reduced topic summaries (keywords + representative texts) to an LLM API to generate human-readable labels. Does NOT send raw corpus text.

Set your API key, then render:

export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"

Outputs are saved to topic analysis/runs/<run-tag>/topics/api/.


Step 4: R topic visualisations (goon_topic_analysis.qmd)

Purpose: Loads the CSV/Parquet outputs from the BERTopic run and produces exploratory visualisations: topic size bar chart, subreddit Γ— topic heatmap, topic prevalence over time, post vs comment split, representative documents.

R packages required:

install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "ggplot2", "stringr", "here", "arrow"))

Run:

quarto render goon_topic_analysis.qmd

Prerequisites: Step 3 must have completed and outputs must exist under topic analysis/runs/full_run_v1/topics/.


Execution order summary

Step 1  β†’  data_prep.qmd            (R)       ~30 min
Step 2  β†’  goon_analysis.qmd        (R)       ~10 min
Step 3  β†’  topic_analysis.qmd       (Python)  ~6–48 hours (full corpus)
Step 3b β†’  topic_api_labeling.qmd   (Python)  ~5 min + API cost (optional)
Step 4  β†’  goon_topic_analysis.qmd  (R)       ~2 min

Steps 2 and 3 are independent of each other and can run in parallel.


Key modelling decisions

Decision Choice Rationale
Embedding model all-MiniLM-L6-v2 Fast, runs on CPU, 384-dim sufficient for topic structure
UMAP n_neighbors 15 BERTopic default; balances local vs global structure
UMAP n_components 5 Low enough for HDBSCAN to work well
HDBSCAN min_cluster_size 50, 100, 200 Tested all three; mcs=100 eom selected as reference
Topic reduction method c-TF-IDF agglomerative Merges semantically similar topics rather than splitting clusters
Reduction targets 100, 50, 25 50 selected for reporting (NPMI=0.27, diversity=0.74)
Preprocessing Minimal Preserves informal language and slang; CountVectorizer handles casing/stopwords
Random seed 42 Applied to UMAP, HDBSCAN sampling, and document cap sampling

Known issue: r/GOONED missing from BERTopic results

r/GOONED is the dominant subreddit in the corpus by a large margin:

Count
r/GOONED posts 2,765,119
r/GOONED comments 15,493,075
r/GOONED total 18,258,194
Full cleaned corpus 22,011,124
r/GOONED share 82.9%

Despite this, r/GOONED is entirely absent from topic analysis/runs/full_run_v1/ outputs. The cloud VM that ran the BERTopic pipeline did not have the GOONED_comments.csv and GOONED_submissions.csv files available (most likely because their combined size of ~7 GB made transfer impractical), and corpus_clean.parquet on the VM was generated without them.

Consequence: All topic modelling results represent 29 subreddits (3.75M records) rather than the full 30-subreddit corpus (22M records). Topic proportions, dominant themes, and subreddit distribution tables are therefore not representative of the full corpus.

To fix: Ensure the GOONED_*.csv files (or the pre-built comments.parquet / posts.parquet) are available on the compute environment, then re-run:

./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2

Note: the R-side analyses (goon_analysis.qmd) are not affected by this issue β€” they read directly from the local data/posts.parquet and data/comments.parquet files, which do contain r/GOONED data.


Known limitations

  1. Outlier rate: 37.3% of documents are assigned to the outlier topic (-1) by HDBSCAN. This is typical for short-text social media corpora. Outliers are excluded from topic analyses but are retained in the corpus parquet files.

  2. Flair ambiguity: Gender/sexuality classifiers rely on voluntarily set flair strings. Flair adoption is uneven across subreddits and users, introducing selection bias. Abbreviations like \\bt\\b may match unintended strings (e.g. US state TX).

  3. Deleted content: Posts and comments marked [deleted] or [removed] are excluded from modelling but counted separately. These may disproportionately represent controversial content.

  4. Temporal coverage: Coverage varies by subreddit. Some communities only appear in later years (2024–2025); others span the full 2019–2025 window.

  5. Platform-specific norms: Moderation rules, flair conventions, and posting styles differ across subreddits, which may shape topics in ways that are not generalisable.

  6. Unobserved participants: Lurkers, banned users, and deleted accounts are not captured.


Output files (full_run_v1)

File Description
topics/final_topic_summary.csv 50 reduced topics with size, keywords, representative texts
topics/final_model_comparison.csv Coherence, diversity, outlier rate for all 9 runs
topics/evaluation_table.csv Same as above, alternate format
topics/bertopic_run_summary.csv Initial topic counts across 6 HDBSCAN configurations
topics/topic_by_subreddit.csv Topic Γ— subreddit document counts
topics/topic_by_month.csv Topic Γ— month document counts
topics/topic_by_doctype.csv Topic Γ— doc type (post/comment)
topics/preprocessing_decisions.json Logged cleaning decisions
topics/api/ API-generated labels, summaries, category annotations
bertopic/ Saved BERTopic models + doc-topic parquet files for all 6 initial runs

Contacts and citation

This analysis was conducted as part of a preliminary empirical study of online gooning communities. If replicating, please cite the original study and note the random seed, embedding model, and reduction target used.