Instructions to use katoernest/bertopic-african-community-reports with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- BERTopic
How to use katoernest/bertopic-african-community-reports with BERTopic:
from bertopic import BERTopic model = BERTopic.load("katoernest/bertopic-african-community-reports") - Notebooks
- Google Colab
- Kaggle
ποΈ BERTopic β African Community Report Clustering
Unsupervised topic discovery model trained on real Swahili news and community report data.
Discovers unknown signal patterns in messy, multilingual community text β no labels required.
Built for civic reporting platforms like Ushahidi Distant Voices where communities submit reports via SMS, WhatsApp, and voice notes across Africa.
Model Summary
| Property | Value |
|---|---|
| Number of topics discovered | 47 |
| Training documents | 1,693 |
| Languages | Swahili + English (multilingual) |
| Embedding model | paraphrase-multilingual-MiniLM-L12-v2 |
| Clustering | HDBSCAN |
| Dimensionality reduction | UMAP |
What This Model Does
Given raw community reports in Swahili or English, the model:
- Groups similar reports into topic clusters automatically
- Assigns keywords that describe each cluster
- Detects signal spikes when a topic suddenly increases in volume
- Works on messy text β typos, abbreviations, mixed languages
Key Topics Discovered
The model discovered 47 topics from real African community data including:
| Topic | Keywords | What It Captures |
|---|---|---|
| 1 | corona, virusi, chanjo | Health / COVID signals |
| 4 | uchaguzi, uganda, kenya, matokeo | East African elections |
| 7 | ethiopia, tigray, mapigano | Conflict / crisis signals |
| 8 | ukraine, urusi, vita, mzozo | Geopolitical conflict |
| 13 | damu, saratani, ugonjwa | Disease / health crisis |
| 23 | camp, road, blocked | Humanitarian access issues |
| 29 | matokeo, arrived, boxes | Election results / voting |
| 41 | raila, odinga, kisiasa | Political reporting |
Topic 23 (main camp road blocked near) is particularly relevant β the model discovered humanitarian access reports as a distinct cluster without being told this category exists.
Usage
pip install -U bertopic
from bertopic import BERTopic
topic_model = BERTopic.load("katoernest/bertopic-african-community-reports")
# Get all discovered topics
topic_model.get_topic_info()
# Classify a new report
topics, probs = topic_model.transform([
"mafuriko makubwa yameharibu mazao shambani",
"voters turned away from polling station",
"food distribution blocked at camp gate"
])
print(topics) # topic IDs
print(probs) # confidence scores
Pipeline Position
This model sits at the signal detection stage of the Distant Voices data pipeline:
Community report received (SMS / WhatsApp / voice note)
β
Transcription (Whisper)
β
[This model] β Topic clustering
β
Spike detection β alerts
β
Human review queue β dashboard
Why Unsupervised Clustering Matters
In a crisis or election, new unknown patterns emerge that nobody pre-labelled. Standard classification fails when the world changes faster than your label set. This model finds those patterns automatically β answering:
"What are communities talking about right now that we have not seen before?"
Dataset
Trained on:
- MasakhaNEWS Swahili β real African news headlines
- Synthetic community reports mirroring civic, climate, crisis, and election submissions
Framework Versions
- BERTopic: latest
- HDBSCAN: 0.8.44
- UMAP: 0.5.12
- Sentence-transformers: 5.5.1
- Python: 3.12.13
Author
Kato Ernest Henry
AI Research & MLOps Engineer β Kampala, Uganda
henry38ernest@gmail.com
HuggingFace
GitHub
- Downloads last month
- 29