Instructions to use spivi87/alephbert-intent-he with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use spivi87/alephbert-intent-he with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="spivi87/alephbert-intent-he")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("spivi87/alephbert-intent-he") model = AutoModelForSequenceClassification.from_pretrained("spivi87/alephbert-intent-he") - Notebooks
- Google Colab
- Kaggle
AlephBERT Hebrew Shopping Intent Classifier · סיווג כוונות בעברית
Fine-tuned AlephBERT for 17-class Hebrew intent classification in a shopping / grocery-bot context. 74.2% accuracy (mean across 3 training seeds) on a properly held-out synthetic test set, ~50 ms CPU inference via ONNX (sibling repo: spivi87/alephbert-intent-he-onnx).
Beats GPT-4o-mini zero-shot by ~17 percentage points on this task while
running locally and free per inference. See ## Baselines comparison below.
Quickstart
from transformers import pipeline
clf = pipeline("text-classification", model="spivi87/alephbert-intent-he", top_k=3)
clf("תוסיף חלב וביצים")
# [{'label': 'GROCERY_REQUEST', 'score': 0.97},
# {'label': 'OTHER', 'score': 0.02},
# {'label': 'UPDATE_QUANTITY', 'score': 0.01}]
Inference returns named labels (not LABEL_0) — id2label is baked into config.json.
Intended Use
Primary: classify short Hebrew messages from a shopping/grocery bot into one of 17 actionable intents (add item, show list, clear list, recipe URL, group admin, etc.).
Also useful as: a fine-tuning starting point for any Hebrew text-classification task — customer support intents, news topic tagging, sentiment proxies, etc. The weights have been adapted to handle short Hebrew utterances with mixed Hebrew/English tokens, typos, and emoji — common in informal messaging.
Recommended confidence threshold: 0.7. Below that, fall back to a generative
model or to an OTHER bucket — see the training script for the exact threshold logic
the production deployment uses.
Performance
- Accuracy: 0.7424 ± 0.0031 (n=3, seeds [42, 43, 44])
- Weighted F1: 0.7357 ± 0.0039
- Macro F1: 0.7357 ± 0.0039
- Test samples: 374 (22 per intent — paraphrases of held-out seeds; no leakage from training)
Per-intent F1
| Intent | F1 (mean ± std) | Support |
|---|---|---|
GROCERY_REQUEST |
0.767 ± 0.023 | 22 |
RECIPE_URL |
0.740 ± 0.029 | 22 |
LIST_QUERY |
0.793 ± 0.054 | 22 |
CLEAR_LIST |
0.680 ± 0.026 | 22 |
REMOVE_ITEM |
0.712 ± 0.061 | 22 |
PARTIAL_COMPLETION |
0.825 ± 0.040 | 22 |
GROUP_INFO |
0.590 ± 0.053 | 22 |
GET_INVITE_CODE |
0.841 ± 0.029 | 22 |
CREATE_INVITE |
0.635 ± 0.069 | 22 |
RENAME_GROUP |
0.993 ± 0.013 | 22 |
LEAVE_GROUP |
0.791 ± 0.031 | 22 |
NOTIFICATION_SETTINGS |
0.615 ± 0.014 | 22 |
REVOKE_INVITE |
0.900 ± 0.011 | 22 |
RECIPE_SEARCH |
0.804 ± 0.028 | 22 |
UPDATE_QUANTITY |
0.977 ± 0.001 | 22 |
BUG_REPORT |
0.340 ± 0.011 | 22 |
OTHER |
0.505 ± 0.009 | 22 |
Full evaluation report and confusion matrix:
EVALUATION.md ·
confusion_matrix.png
Baselines comparison
The model is meaningful only relative to alternatives. All numbers below are on the same 374-row seed-level held-out test set.
| Approach | Accuracy | Weighted F1 | Macro F1 | Cost / 1k | Latency / call |
|---|---|---|---|---|---|
| Random | 0.0668 | 0.0637 | 0.0637 | $0 | 0 ms |
| Majority class | 0.0588 | 0.0065 | 0.0065 | $0 | 0 ms |
| Keyword regex (hand-crafted) | 0.2487 | 0.2834 | 0.2834 | $0 | < 0.1 ms |
| GPT-4o-mini zero-shot | 0.5722 | 0.5916 | 0.5916 | ~$0.05 (gpt-4o-mini Jan 2026 pricing, ~250-token prompt) | 101.4 ms |
| AlephBERT fine-tune (ours) | 0.7424 ± 0.0031 | 0.7357 ± 0.0039 | 0.7357 ± 0.0039 | $0 | ~50 ms |
The fine-tune beats GPT-4o-mini zero-shot by a meaningful margin and is free per inference and ~4× faster. Beating a strong general-purpose LLM with a 110M-parameter Hebrew BERT fine-tune is what justifies the training cost in the first place.
Training Data
Fully synthetic. Hebrew seed templates (12–20 per intent) were paraphrased via GPT-4o-mini (10 variations per seed, temperature 0.9), yielding ~2,100 labeled examples across the 17 intents. No real WhatsApp / user messages were used — there is no PII leakage risk.
Generation script and seed templates live in the standalone GitHub repo: github.com/spivi/alephbert-intent-he. A ~100-row sample is published as spivi87/alephbert-intent-he-samples.
Methodology — seed-level train/test split
The split happens at the seed level, before paraphrasing: for every intent, 2 seeds are held out from training, and the test set contains only paraphrases of those held-out seeds.
This avoids the common pitfall of LLM-generated synthetic data where paraphrases of the same seed land in both train and test (a "leak") and inflate reported accuracy. The 74.2% headline is what the model actually achieves on text whose source seed it has never seen during training.
The per-intent split is recorded in split_manifest.json in the generation
output directory for full reproducibility.
Training Procedure
| Setting | Value |
|---|---|
| Base model | onlplab/alephbert-base |
| Optimizer | AdamW (HF Trainer default) |
| Learning rate | 2e-5 (linear warmup, linear decay) |
| Batch size | 16 (train) / 32 (eval) |
| Max sequence length | 128 tokens |
| Max epochs | 10 (early stopping on eval_accuracy, patience=3) |
| Loss | Cross-entropy |
| Mixed precision | fp32 |
| Random seed | 42 (Python / NumPy / PyTorch / Trainer all pinned) |
Reproduce the run exactly:
python scripts/hf_classifier/generate_training_data.py \
--output-dir data/hf_classifier --test-seeds-per-intent 2 --seed 42
python scripts/hf_classifier/train_classifier.py \
--data-dir data/hf_classifier --output-dir <out> --seed 42
Compute & Environmental Footprint
| Training compute | ~10 minutes on Apple M-series GPU (PyTorch MPS backend) |
| Equivalent | ~30 minutes on a single Google Colab T4 |
| Estimated CO₂ | < 5 g CO₂eq (single training run on personal hardware) |
| Software stack | transformers >= 4.40, torch >= 2.3, datasets >= 3.0, accelerate >= 0.26 |
| Data generation | ~2-5 min wall time + ~$0.02-0.05 OpenAI API spend for the synthetic corpus |
Out-of-scope Use
This model is not intended for:
- Arabic, English-only, or other non-Hebrew text.
- Long-form text (> 128 tokens). Tokenizer truncates; the model was never trained on longer inputs.
- Non-shopping domains. Treat as a fine-tuning starting point, not a drop-in classifier for customer support / news / sentiment / etc.
- Safety or abuse classification.
OTHERis a "doesn't fit shopping intents" bucket, not a content filter. - Mizrahi, Ashkenazi-modern, ultra-Orthodox / Haredi, or heavy code-switching Hebrew dialects — these are under-represented in synthetic GPT-4o-mini output and accuracy on them is not measured.
Label Glossary
| ID | Label | English description |
|---|---|---|
| 0 | GROCERY_REQUEST |
Add items to the shopping list |
| 1 | RECIPE_URL |
Recipe URL — extract ingredients from a linked recipe |
| 2 | LIST_QUERY |
Show the current shopping list |
| 3 | CLEAR_LIST |
Mark all items as bought; clear the list |
| 4 | REMOVE_ITEM |
Remove a specific item from the list |
| 5 | PARTIAL_COMPLETION |
Mark most items bought except for some |
| 6 | GROUP_INFO |
Show group members and details |
| 7 | GET_INVITE_CODE |
Get the existing group invite code |
| 8 | CREATE_INVITE |
Generate a new group invite code |
| 9 | RENAME_GROUP |
Change the group name |
| 10 | LEAVE_GROUP |
Leave the current group |
| 11 | NOTIFICATION_SETTINGS |
Toggle notification preferences |
| 12 | REVOKE_INVITE |
Cancel or invalidate a group invite code |
| 13 | RECIPE_SEARCH |
Build a shopping list for a known dish |
| 14 | UPDATE_QUANTITY |
Change the quantity of an existing item |
| 15 | BUG_REPORT |
Report a bug or issue with the bot |
| 16 | OTHER |
Conversational or off-topic message; not a shopping intent |
Limitations
OTHERis intentionally a catch-all and the weakest class (F1 ≈ 0.50). Production routes any prediction withscore < 0.7to a generative LLM fallback. We recommend the same pattern in your downstream task: don't trustOTHERas a positive signal — trust it only as a "nothing else fired" signal.- Class imbalance: train support varies (the per-intent class sizes are roughly uniform but not exact). We report macro F1 alongside weighted F1: macro is more honest for uniform-ish classes (treats every intent equally), weighted is the one to read when you care most about the dominant classes. In our case they agree closely.
- Domain-specific. Trained on grocery/shopping intents. Performance outside this domain is unknown — treat as a fine-tuning base, not a general Hebrew classifier.
- No Arabic, no English-only support. Inputs assumed to be Hebrew (occasional Hebrew/English code-switching is fine — training data includes it).
- Short utterances only. Trained on inputs up to 128 tokens. Long-form text is not supported.
Biases & Demographic Coverage
The synthetic training data was generated by GPT-4o-mini paraphrasing Hebrew seed templates. GPT-4o-mini produces mostly "neutral modern Hebrew" — the dialects and registers below are known coverage gaps:
- Mizrahi-influenced Hebrew (vowel patterns, loanwords from Arabic / Ladino / Persian)
- Ultra-Orthodox / Haredi Hebrew (religious code-switching with Yiddish and Aramaic terms)
- Young / slang Hebrew (informal contractions, recent borrowings, internet jargon)
- Mixed Hebrew–Russian / Hebrew–Amharic code-switching common in immigrant communities
- Spelling variants that drop niqqud or use non-standard final letters
We did not evaluate on these and make no accuracy claim for them. If your production traffic includes any of the above, do your own held-out evaluation before deploying.
Attribution & License
Base model: onlplab/alephbert-base
(Apache 2.0). This fine-tune is also distributed under Apache 2.0. See
LICENSE and LICENSE-ALEPHBERT.md in the
GitHub repo.
Citation
If you use this model in academic work, please cite the original AlephBERT paper:
@inproceedings{seker-etal-2022-alephbert,
title = "{A}leph{BERT}: Language Model Pre-training and Evaluation
from Sub-Word to Sentence Level",
author = "Seker, Amit and Bandel, Elron and Bareket, Dan and
Brusilovsky, Idan and Greenfeld, Refael and Tsarfaty, Reut",
booktitle = "Proceedings of the 60th Annual Meeting of the ACL",
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.4",
pages = "46--56",
}
For the fine-tune itself:
@misc{spivakovsky2026alephbertintenthe,
title = {{AlephBERT Hebrew Shopping Intent Classifier}},
author = {Spivakovsky, Alex},
year = {2026},
publisher = {Hugging Face},
howpublished = {\\url{https://huggingface.co/spivi87/alephbert-intent-he}},
}
- Downloads last month
- 48
Model tree for spivi87/alephbert-intent-he
Base model
onlplab/alephbert-base