Rogendo commited on
Commit
329005a
·
verified ·
1 Parent(s): d83c55d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -0
README.md CHANGED
@@ -6,5 +6,152 @@ colorTo: yellow
6
  sdk: static
7
  pinned: false
8
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  Edit this `README.md` markdown file to author your organization card.
 
6
  sdk: static
7
  pinned: false
8
  ---
9
+ # JengaAI
10
+
11
+ **Open-source ML training and inference framework for African AI.**
12
+
13
+ Train multi-task NLP, Speech, and Vision models with a single YAML config — no code required. Built in Kenya, built for Africa.
14
+
15
+ ---
16
+
17
+ ## What We Build
18
+
19
+ JengaAI is a framework that lets researchers, engineers, and non-technical teams train production-grade machine learning models on African language data — and deploy them without vendor lock-in, without API dependencies, and without sending sensitive data to foreign servers.
20
+
21
+ > *Your model. Your data. Your task.*
22
+
23
+ ---
24
+
25
+ ## Models
26
+
27
+ | Model | Task | Language | Base |
28
+ |-------|------|----------|------|
29
+ | [Rogendo/afribert-kenya-adapted](https://huggingface.co/Rogendo/afribert-kenya-adapted) | Masked Language Modeling (DAPT) | Swahili · Sheng · English | castorini/afriberta_large |
30
+ | [Rogendo/cpims-nlp-intent-urgency](https://huggingface.co/Rogendo/cpims-nlp-intent-urgency) | Intent + Urgency Classification | Swahili · Sheng · English | afribert-kenya-adapted |
31
+
32
+ ### afribert-kenya-adapted
33
+ Domain-adaptive pre-training of AfriBERT on ~39M tokens of Kenyan language data — Swahili Wikipedia, East African journalism, synthetic Sheng/code-switch corpus, and real CPIMS field worker WhatsApp data. Achieves **30.4% average perplexity improvement** over the base model on Kenyan domain text, with **66% improvement on Sheng** and **41% on English-Swahili code-switching**.
34
+
35
+ ### cpims-nlp-intent-urgency
36
+ Multi-task classifier trained on CPIMS child protection support messages. Simultaneously predicts **63 intent classes** and **urgency level** (high / medium / low) from a single encoder pass. Intent F1: **74.5%** — up from 46% on a generic English base model. Handles English, Swahili, and Kenyan code-switching.
37
+
38
+
39
+ With this framework, the possibilites of languange and Natral language processeng are limitless!
40
+ ---
41
+
42
+ ## The Framework
43
+
44
+ ```bash
45
+ pip install jenga-ai
46
+ ```
47
+
48
+ Train any model with a single YAML config:
49
+
50
+ ```yaml
51
+ project_name: swahili-hate-speech
52
+
53
+ model:
54
+ base_model: castorini/afriberta_large
55
+ max_seq_len: 128
56
+
57
+ tasks:
58
+ - name: classification
59
+ type: single_label_classification
60
+ data_path: data/hate_speech.csv
61
+ text_column: text
62
+ label_column: label
63
+
64
+ training:
65
+ epochs: 5
66
+ batch_size: 16
67
+ learning_rate: 3.0e-5
68
+ ```
69
+
70
+ ```bash
71
+ python -m jenga_ai train --config swahili-hate-speech.yaml
72
+ ```
73
+
74
+ ### Supported modalities
75
+
76
+ | Modality | Status | Notes |
77
+ |----------|--------|-------|
78
+ | NLP — classification, NER, multi-task | ✅ Production | Multi-task with shared encoder + dual heads |
79
+ | Speech — Whisper fine-tuning, transcription | ⚙️ Active development | ASR for Swahili and African languages |
80
+ | Vision — classification, OCR, object detection | ⚙️ Active development | Document verification, image classification |
81
+ | LLM — LoRA fine-tuning, Ollama integration | ⚙️ Active development | Swahili instruction tuning |
82
+
83
+ ### Key capabilities
84
+
85
+ - **Multi-task learning** — one encoder, multiple task heads, shared representations
86
+ - **Domain adaptation** — continued MLM pre-training for African language domains
87
+ - **Responsible AI built in** — explainability engine, audit trail, human-in-the-loop routing, bias evaluation
88
+ - **Offline-first** — trained models run without internet, no per-query API cost
89
+ - **HuggingFace native** — load any HF model as base, push trained models to Hub
90
+ - **No-code web platform** — upload CSV, click Train, get predictions
91
+
92
+ ---
93
+
94
+ ## Why JengaAI Exists
95
+
96
+ Africa's AI ecosystem is being built on API wrappers — products that call GPT-4 or Claude and rebrand the output as "African AI." These products are expensive at scale, dependent on foreign infrastructure, unable to handle African languages properly, and unable to keep sensitive data on the continent.
97
+
98
+ JengaAI exists to make the alternative practical.
99
+
100
+ A locally trained, domain-adapted model:
101
+ - Costs nothing at inference time after training
102
+ - Runs fully offline in low-connectivity environments
103
+ - Can be fine-tuned on your specific institutional language
104
+ - Keeps sensitive data — health records, case notes, financial transactions — under your control
105
+ - Performs significantly better on African languages than generic multilingual models
106
+
107
+ ---
108
+
109
+ ## Use Cases
110
+
111
+ **Child protection systems** — intent classification and urgency triage for CPIMS support messages in English, Swahili, and Sheng
112
+
113
+ **Community health** — symptom extraction and referral urgency from CHW voice notes and field reports
114
+
115
+ **Financial services** — M-PESA dispute classification, fraud signal detection, transaction intent analysis
116
+
117
+ **Government services** — citizen complaint routing, document OCR, service request classification
118
+
119
+ **Education** — student question routing, learner sentiment analysis, multilingual content classification
120
+
121
+ **Media monitoring** — hate speech detection, misinformation flagging, topic classification in Swahili and code-switched text
122
+
123
+ ---
124
+
125
+ ## Responsible AI
126
+
127
+ JengaAI is built with responsible AI development as a core principle, not an afterthought:
128
+
129
+ - **Explainability** — every prediction can be explained in human-readable terms
130
+ - **Audit trails** — hash-chained tamper-evident logging of all inference decisions
131
+ - **Human-in-the-loop** — low-confidence predictions are flagged for human review
132
+ - **Bias evaluation** — evaluation across language, demographic, and domain subgroups
133
+ - **Data sovereignty** — models run locally, data never leaves your infrastructure
134
+ - **Transparent limitations** — model cards document what the model gets wrong, not just what it gets right
135
+
136
+ ---
137
+
138
+ ## Community
139
+
140
+ JengaAI is developed in the spirit of African AI communities doing the work right — [Data Science Africa](https://www.datascienceafrica.org/), [Masakhane](https://www.masakhane.io/), [Deep Learning Indaba](https://deeplearningindaba.com/), and [AIMS](https://nexteinstein.org/).
141
+
142
+ We believe that building AI for Africa means building it on African data, in African languages, with African institutional contexts — not wrapping foreign models in local branding.
143
+
144
+ ---
145
+
146
+ ## Links
147
+
148
+ - 🐙 **GitHub**: [github.com/Rogendo/JengaAI](https://github.com/Rogendo/JengaAI)
149
+ - 📦 **Framework**: `pip install jenga-ai`
150
+ - 📄 **Docs**: Coming soon
151
+ - 🤝 **Contribute**: Open to contributors — researchers, engineers, domain experts, annotators
152
+
153
+ ---
154
+
155
+ *Built in Kenya 🇰🇪 — for Africa and beyond.*
156
 
157
  Edit this `README.md` markdown file to author your organization card.