YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
TabuLM: Morphology-Aware Tabular Pre-training for Low-Resource Languages
Paper under review β EMNLP 2026 ARR May cycle.
TabuLM is the first pre-trained language model that jointly captures morphological richness and tabular relational structure for a low-resource language. Built on top of KinyaBERT, it introduces three tabular-aware additions to the sequence transformer:
| Component | What it does |
|---|---|
| Row / Col / CellType Embeddings | Additive embeddings encoding each token's grid position and cell type |
| Table-Structure Attention Bias | Learned per-head scalars boosting same-row, same-column, and header attention |
| MCR + CTP objectives | Masked Cell Recovery and Column Type Prediction pre-training tasks |
We also release TabQA-kin, the first native Kinyarwanda table question-answering benchmark (526 QA pairs across 31 government tables).
Results on TabQA-kin (dev set)
| Model | Lookup | Comparison | Aggregation | Overall EM |
|---|---|---|---|---|
| GPT-4o (zero-shot) | 82.9 | 79.2 | 25.9 | 64.0 |
| GPT-4o-mini (zero-shot) | 85.7 | 70.8 | 29.6 | 64.0 |
| mBERT (fine-tuned) | 16.7 | 50.0 | 80.8 | 49.3 |
| XLM-R (fine-tuned) | 19.2 | 44.4 | 85.2 | 50.0 |
| KinyaBERT-large (fine-tuned) | 26.7 | 59.1 | 88.9 | 56.3 |
| TabuLM (ours) | 28.6 | 66.7 | 79.2 | 62.0 |
Key finding: GPT-4o and GPT-4o-mini both score 64.0% β a scale-independent LLM ceiling driven by aggregation failure (25β30%). All fine-tuned models break through this ceiling on aggregation (non-overlapping 95% Wilson CIs, statistically significant).
Repository Structure
TabuLM/
βββ code/
β βββ train_tabulm.py # Pre-training (distributed, LAMB optimizer)
β βββ finetune_tabqa.py # Fine-tune + evaluate on TabQA-kin
β βββ eval_baselines.py # mBERT / XLM-R / KinyaBERT baselines
β βββ eval_llm_baseline.py # GPT-4o / GPT-4o-mini zero-shot eval
β βββ eval_llm_agg_fewshot.py # 3-shot aggregation experiment
βββ data/
β βββ tables/ # 172 Kinyarwanda pre-training tables (CSV)
β βββ tabqa_kin.json # TabQA-kin benchmark (526 QA pairs)
βββ results/
β βββ finetune_tabqa_v3_results.json
β βββ baseline_results.json
β βββ llm_baseline_v2_results.json
β βββ llm_baseline_mini_results.json
βββ paper/
βββ tabulm_emnlp2026.tex
βββ tabulm_refs.bib
Setup
git clone https://github.com/TabuLM-Research/TabuLM.git
cd TabuLM
conda create -n tabulm python=3.9
conda activate tabulm
pip install torch transformers youtokentome openai tqdm scipy
Morphological analyzer note: Tier 1 morphological analysis requires
libkinlp.sofrom the KinyaBERT repository. Domain tokens (numerals, entity names) fall back to BPE automatically β all tabular training and evaluation runs on the BPE fallback path without the binary.
Pre-training
CUDA_VISIBLE_DEVICES=0 python code/train_tabulm.py \
-g 1 \
--batch-size 8 \
--accumulation-steps 8 \
--number-of-load-batches 24 \
--num-iters 10000 \
--warmup-iter 500 \
--seq-tr-nhead 8
Pre-training takes ~7 hours on a single NVIDIA RTX 3090 (24 GB), warm-started from a KinyaBERT checkpoint.
Fine-tuning on TabQA-kin
python code/finetune_tabqa.py \
--checkpoint data/tabulm_model_pretrained.pt \
--output-prefix data/finetune_tabqa
Fine-tuning runs for 20 epochs with AdamW lr=2e-5, top-4 layers unfrozen. Converges in under 30 minutes on a single GPU.
Evaluating Baselines
# Fine-tuned text models
python code/eval_baselines.py --model mbert
python code/eval_baselines.py --model xlmr
python code/eval_baselines.py --model kinyabert
# Zero-shot LLM baselines (bring your own key)
python code/eval_llm_baseline.py --provider openai --api-key YOUR_KEY
python code/eval_llm_baseline.py --provider openai --model gpt-4o-mini --api-key YOUR_KEY
Pre-trained Checkpoints
Checkpoints will be released on Hugging Face upon paper acceptance.
| File | Size | Description |
|---|---|---|
tabulm_pretrained.pt |
751 MB | Pre-trained TabuLM encoder (10K iters) |
tabulm_tabqa_finetuned.pt |
751 MB | Fine-tuned on TabQA-kin (best dev EM 62.0%) |
Citation
@inproceedings{tabulm2026emnlp,
title = {TabuLM: Morphology-Aware Tabular Pre-training for Low-Resource Languages},
author = {Anonymous},
booktitle = {Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing},
year = {2026},
note = {Under review}
}
License
- Code: MIT License
- Pre-training data: Sourced from Rwanda government open-data portals (public domain)
- TabQA-kin benchmark: CC BY 4.0