editor_options:
markdown:
wrap: 72
Replication Guide β Grasping 'Gooning'
An observational analysis of Reddit gooning communities via BERTopic topic modelling.
Project summary
This study empirically analyses publicly available Reddit subreddits dedicated to gooning (prolonged, trancelike masturbation). The analysis pipeline combines:
- Descriptive statistics β demographics (age, gender, sexuality from user flair), word frequencies, and substance mentions across 30 subreddits
- BERTopic topic modelling β semantic clustering of ~2.35M documents (posts and comments) to identify the major themes and discourse patterns
Full corpus: 30 subreddits, ~22M raw records, 2019β2025
Repository structure
Goon/
βββ REPLICATION.md β this file
βββ README.md β subreddit list
βββ Goon.Rproj β RStudio project (sets working directory)
βββ data_prep.qmd β Step 1: R β ingest CSVs, clean, export Parquet
βββ goon_analysis.qmd β Step 2: R β descriptive analyses
βββ goon_topic_analysis.qmd β Step 4: R β visualise BERTopic outputs
βββ data/
β βββ *.csv β raw Reddit exports (one file per subreddit per year)
β βββ GOONED_comments.csv β pre-concatenated GOONED comments (all years)
β βββ GOONED_submissions.csv β pre-concatenated GOONED submissions (all years)
β βββ comments.parquet β output of data_prep.qmd (18.2M rows)
β βββ posts.parquet β output of data_prep.qmd (3.8M rows)
β βββ corpus_clean.parquet β output of topic_analysis.qmd (modelling corpus)
β βββ corpus_deleted.parquet β deleted/removed rows (tracked separately)
βββ topic analysis/
β βββ topic_analysis.qmd β Step 3: Python β BERTopic pipeline
β βββ topic_api_labeling.qmd β Step 3b: Python β optional API-based topic labelling
β βββ build_topic_results_summary.py β helper: generate an HTML summary from run outputs
β βββ run_topic_analysis.sh β shell wrapper that sets env vars and calls quarto
β βββ README.md β detailed pipeline documentation
β βββ MEMORY_MANAGEMENT.md β strategies for large-corpus runs
β βββ runs/
β βββ full_run_v1/ β complete output of the full corpus run
β βββ bertopic/ β saved BERTopic models + doc-topic assignments
β βββ topics/ β keyword tables, summaries, evaluation, API labels
β βββ figures/ β generated charts
βββ .venv/ β Python virtual environment (created per instructions below)
System requirements
Hardware
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
| Disk | 50 GB free | 100 GB free |
| CPU | Any modern multi-core | 8+ cores for embedding generation |
| GPU | Not required | Optional β speeds up embedding by ~10Γ |
The full corpus run was executed on a Linux VM with 96 GB RAM. A local Mac with 24 GB will work for the pilot/sample runs but may struggle with the full corpus embedding step.
Software
| Tool | Version tested | Notes |
|---|---|---|
| R | β₯ 4.3 | For data_prep.qmd and goon_analysis.qmd |
| RStudio / Quarto CLI | β₯ 1.4 | To render .qmd files |
| Python | β₯ 3.10 | For topic_analysis.qmd |
| Quarto | β₯ 1.4 | Installed with RStudio or standalone |
Step-by-step replication
Step 0: Clone / obtain the repository
The data/ folder containing raw CSVs is required. These are large
files and are not distributed via git β they must be present locally.
Open Goon.Rproj in RStudio. This sets the working directory to the
project root so that here::here() paths resolve correctly.
Step 1: R data preparation (data_prep.qmd)
Purpose: Reads all raw CSVs, combines them into unified data frames,
applies minimal cleaning, and exports data/comments.parquet and
data/posts.parquet.
R packages required:
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"data.table", "arrow", "here"))
Run:
Open data_prep.qmd in RStudio and click Render (or run all chunks
in order).
Alternatively, from the terminal:
cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd
Expected outputs: - data/comments.parquet β ~18.2M rows, ~528
MB - data/posts.parquet β ~3.8M rows, ~186 MB
Time estimate: 15β45 minutes depending on RAM and I/O speed (the GOONED CSV files alone are ~7 GB).
Step 2: R descriptive analysis (goon_analysis.qmd)
Purpose: Demographic analysis (r/GOONEDmeetup flair), word frequency analysis, substance mention counts.
R packages required:
install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
"stringr", "here", "e1071", "tidytext",
"data.table", "arrow"))
Run:
Open goon_analysis.qmd in RStudio and click Render.
Prerequisites: data/comments.parquet and data/posts.parquet must
exist (Step 1).
Expected outputs: - Rendered HTML with plots embedded - In-memory: word count tables, substance mention counts, demographic counts
Step 3: Python BERTopic topic modelling (topic analysis/topic_analysis.qmd)
Purpose: Embeds all documents with all-MiniLM-L6-v2, runs HDBSCAN
topic modelling via BERTopic, reduces topics using c-TF-IDF
agglomerative clustering, evaluates models, and exports all topic
outputs.
3a. Create the Python virtual environment
cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate
pip install bertopic umap-learn hdbscan sentence-transformers \
scikit-learn pandas pyarrow quarto
Key package versions (tested):
| Package | Version |
|---|---|
| bertopic | 0.16.x |
| umap-learn | 0.5.x |
| hdbscan | 0.8.x |
| sentence-transformers | 2.x |
| scikit-learn | 1.x |
| pandas | 2.x |
| pyarrow | 14.x |
3b. Run the pipeline
Pilot run (200k documents β recommended first):
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
Outputs land in topic analysis/runs/pilot_200k/.
Full corpus run (~2.35M cleaned documents after filtering):
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
Warning: The full run requires ~64 GB RAM for the embedding + UMAP stages. See
topic analysis/MEMORY_MANAGEMENT.mdfor strategies if RAM is limited.
Pipeline stages (automatically executed in order):
- Data ingestion β loads
data/posts.parquet+data/comments.parquet - Light cleaning β deduplication, URL/username/subreddit
anonymisation, markdown stripping; deleted/removed rows saved
separately to
data/corpus_deleted.parquet - Embedding β
all-MiniLM-L6-v2(384-dim), batch_size=512, saved as shards; skips already-generated shards on resume - UMAP β pre-computed once, reused across all HDBSCAN configs
- BERTopic β 6 configurations: min_cluster_size β {50, 100, 200} Γ method β {eom, leaf}
- Topic reduction β c-TF-IDF agglomerative clustering to 100, 50, and 25 topics
- Evaluation β NPMI coherence, topic diversity, outlier rates
- Export β keyword tables, representative docs, summary CSVs
Reproducibility: Random seed = 42 throughout. A
reproducibility_log.json is written to the run folder with all
settings and package versions.
3c. Optional API-based topic labelling (topic_api_labeling.qmd)
Sends reduced topic summaries (keywords + representative texts) to an LLM API to generate human-readable labels. Does NOT send raw corpus text.
Set your API key, then render:
export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"
Outputs are saved to topic analysis/runs/<run-tag>/topics/api/.
Step 4: R topic visualisations (goon_topic_analysis.qmd)
Purpose: Loads the CSV/Parquet outputs from the BERTopic run and produces exploratory visualisations: topic size bar chart, subreddit Γ topic heatmap, topic prevalence over time, post vs comment split, representative documents.
R packages required:
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"ggplot2", "stringr", "here", "arrow"))
Run:
quarto render goon_topic_analysis.qmd
Prerequisites: Step 3 must have completed and outputs must exist
under topic analysis/runs/full_run_v1/topics/.
Execution order summary
Step 1 β data_prep.qmd (R) ~30 min
Step 2 β goon_analysis.qmd (R) ~10 min
Step 3 β topic_analysis.qmd (Python) ~6β48 hours (full corpus)
Step 3b β topic_api_labeling.qmd (Python) ~5 min + API cost (optional)
Step 4 β goon_topic_analysis.qmd (R) ~2 min
Steps 2 and 3 are independent of each other and can run in parallel.
Key modelling decisions
| Decision | Choice | Rationale |
|---|---|---|
| Embedding model | all-MiniLM-L6-v2 |
Fast, runs on CPU, 384-dim sufficient for topic structure |
| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |
Known issue: r/GOONED missing from BERTopic results
r/GOONED is the dominant subreddit in the corpus by a large margin:
| Count | |
|---|---|
| r/GOONED posts | 2,765,119 |
| r/GOONED comments | 15,493,075 |
| r/GOONED total | 18,258,194 |
| Full cleaned corpus | 22,011,124 |
| r/GOONED share | 82.9% |
Despite this, r/GOONED is entirely absent from
topic analysis/runs/full_run_v1/ outputs. The cloud VM that ran the
BERTopic pipeline did not have the GOONED_comments.csv and
GOONED_submissions.csv files available (most likely because their
combined size of ~7 GB made transfer impractical), and
corpus_clean.parquet on the VM was generated without them.
Consequence: All topic modelling results represent 29 subreddits (3.75M records) rather than the full 30-subreddit corpus (22M records). Topic proportions, dominant themes, and subreddit distribution tables are therefore not representative of the full corpus.
To fix: Ensure the GOONED_*.csv files (or the pre-built
comments.parquet / posts.parquet) are available on the compute
environment, then re-run:
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
Note: the R-side analyses (goon_analysis.qmd) are not affected by
this issue β they read directly from the local data/posts.parquet and
data/comments.parquet files, which do contain r/GOONED data.
Known limitations
Outlier rate: 37.3% of documents are assigned to the outlier topic (-1) by HDBSCAN. This is typical for short-text social media corpora. Outliers are excluded from topic analyses but are retained in the corpus parquet files.
Flair ambiguity: Gender/sexuality classifiers rely on voluntarily set flair strings. Flair adoption is uneven across subreddits and users, introducing selection bias. Abbreviations like
\\bt\\bmay match unintended strings (e.g. US state TX).Deleted content: Posts and comments marked
[deleted]or[removed]are excluded from modelling but counted separately. These may disproportionately represent controversial content.Temporal coverage: Coverage varies by subreddit. Some communities only appear in later years (2024β2025); others span the full 2019β2025 window.
Platform-specific norms: Moderation rules, flair conventions, and posting styles differ across subreddits, which may shape topics in ways that are not generalisable.
Unobserved participants: Lurkers, banned users, and deleted accounts are not captured.
Output files (full_run_v1)
| File | Description |
|---|---|
topics/final_topic_summary.csv |
50 reduced topics with size, keywords, representative texts |
topics/final_model_comparison.csv |
Coherence, diversity, outlier rate for all 9 runs |
topics/evaluation_table.csv |
Same as above, alternate format |
topics/bertopic_run_summary.csv |
Initial topic counts across 6 HDBSCAN configurations |
topics/topic_by_subreddit.csv |
Topic Γ subreddit document counts |
topics/topic_by_month.csv |
Topic Γ month document counts |
topics/topic_by_doctype.csv |
Topic Γ doc type (post/comment) |
topics/preprocessing_decisions.json |
Logged cleaning decisions |
topics/api/ |
API-generated labels, summaries, category annotations |
bertopic/ |
Saved BERTopic models + doc-topic parquet files for all 6 initial runs |
Contacts and citation
This analysis was conducted as part of a preliminary empirical study of online gooning communities. If replicating, please cite the original study and note the random seed, embedding model, and reduction target used.