πŸ—‚οΈ BERTopic β€” African Community Report Clustering

Unsupervised topic discovery model trained on real Swahili news and community report data.
Discovers unknown signal patterns in messy, multilingual community text β€” no labels required.

Built for civic reporting platforms like Ushahidi Distant Voices where communities submit reports via SMS, WhatsApp, and voice notes across Africa.

Model Summary

Property Value
Number of topics discovered 47
Training documents 1,693
Languages Swahili + English (multilingual)
Embedding model paraphrase-multilingual-MiniLM-L12-v2
Clustering HDBSCAN
Dimensionality reduction UMAP

What This Model Does

Given raw community reports in Swahili or English, the model:

  • Groups similar reports into topic clusters automatically
  • Assigns keywords that describe each cluster
  • Detects signal spikes when a topic suddenly increases in volume
  • Works on messy text β€” typos, abbreviations, mixed languages

Key Topics Discovered

The model discovered 47 topics from real African community data including:

Topic Keywords What It Captures
1 corona, virusi, chanjo Health / COVID signals
4 uchaguzi, uganda, kenya, matokeo East African elections
7 ethiopia, tigray, mapigano Conflict / crisis signals
8 ukraine, urusi, vita, mzozo Geopolitical conflict
13 damu, saratani, ugonjwa Disease / health crisis
23 camp, road, blocked Humanitarian access issues
29 matokeo, arrived, boxes Election results / voting
41 raila, odinga, kisiasa Political reporting

Topic 23 (main camp road blocked near) is particularly relevant β€” the model discovered humanitarian access reports as a distinct cluster without being told this category exists.

Usage

pip install -U bertopic
from bertopic import BERTopic

topic_model = BERTopic.load("katoernest/bertopic-african-community-reports")

# Get all discovered topics
topic_model.get_topic_info()

# Classify a new report
topics, probs = topic_model.transform([
    "mafuriko makubwa yameharibu mazao shambani",
    "voters turned away from polling station",
    "food distribution blocked at camp gate"
])
print(topics)   # topic IDs
print(probs)    # confidence scores

Pipeline Position

This model sits at the signal detection stage of the Distant Voices data pipeline:

Community report received (SMS / WhatsApp / voice note)
                ↓
       Transcription (Whisper)
                ↓
    [This model] β€” Topic clustering
                ↓
    Spike detection β†’ alerts
                ↓
   Human review queue β†’ dashboard

Why Unsupervised Clustering Matters

In a crisis or election, new unknown patterns emerge that nobody pre-labelled. Standard classification fails when the world changes faster than your label set. This model finds those patterns automatically β€” answering:

"What are communities talking about right now that we have not seen before?"

Dataset

Trained on:

  • MasakhaNEWS Swahili β€” real African news headlines
  • Synthetic community reports mirroring civic, climate, crisis, and election submissions

Framework Versions

  • BERTopic: latest
  • HDBSCAN: 0.8.44
  • UMAP: 0.5.12
  • Sentence-transformers: 5.5.1
  • Python: 3.12.13

Author

Kato Ernest Henry
AI Research & MLOps Engineer β€” Kampala, Uganda
henry38ernest@gmail.com
HuggingFace
GitHub

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support