AlephBERT Hebrew Shopping Intent Classifier · סיווג כוונות בעברית

Fine-tuned AlephBERT for 17-class Hebrew intent classification in a shopping / grocery-bot context. 74.2% accuracy (mean across 3 training seeds) on a properly held-out synthetic test set, ~50 ms CPU inference via ONNX (sibling repo: spivi87/alephbert-intent-he-onnx).

Beats GPT-4o-mini zero-shot by ~17 percentage points on this task while running locally and free per inference. See ## Baselines comparison below.

Quickstart

from transformers import pipeline

clf = pipeline("text-classification", model="spivi87/alephbert-intent-he", top_k=3)
clf("תוסיף חלב וביצים")
# [{'label': 'GROCERY_REQUEST', 'score': 0.97},
#  {'label': 'OTHER',           'score': 0.02},
#  {'label': 'UPDATE_QUANTITY', 'score': 0.01}]

Inference returns named labels (not LABEL_0) — id2label is baked into config.json.

Intended Use

Primary: classify short Hebrew messages from a shopping/grocery bot into one of 17 actionable intents (add item, show list, clear list, recipe URL, group admin, etc.).

Also useful as: a fine-tuning starting point for any Hebrew text-classification task — customer support intents, news topic tagging, sentiment proxies, etc. The weights have been adapted to handle short Hebrew utterances with mixed Hebrew/English tokens, typos, and emoji — common in informal messaging.

Recommended confidence threshold: 0.7. Below that, fall back to a generative model or to an OTHER bucket — see the training script for the exact threshold logic the production deployment uses.

Performance

Accuracy: 0.7424 ± 0.0031 (n=3, seeds [42, 43, 44])
Weighted F1: 0.7357 ± 0.0039
Macro F1: 0.7357 ± 0.0039
Test samples: 374 (22 per intent — paraphrases of held-out seeds; no leakage from training)

Per-intent F1

Intent	F1 (mean ± std)	Support
`GROCERY_REQUEST`	0.767 ± 0.023	22
`RECIPE_URL`	0.740 ± 0.029	22
`LIST_QUERY`	0.793 ± 0.054	22
`CLEAR_LIST`	0.680 ± 0.026	22
`REMOVE_ITEM`	0.712 ± 0.061	22
`PARTIAL_COMPLETION`	0.825 ± 0.040	22
`GROUP_INFO`	0.590 ± 0.053	22
`GET_INVITE_CODE`	0.841 ± 0.029	22
`CREATE_INVITE`	0.635 ± 0.069	22
`RENAME_GROUP`	0.993 ± 0.013	22
`LEAVE_GROUP`	0.791 ± 0.031	22
`NOTIFICATION_SETTINGS`	0.615 ± 0.014	22
`REVOKE_INVITE`	0.900 ± 0.011	22
`RECIPE_SEARCH`	0.804 ± 0.028	22
`UPDATE_QUANTITY`	0.977 ± 0.001	22
`BUG_REPORT`	0.340 ± 0.011	22
`OTHER`	0.505 ± 0.009	22

Full evaluation report and confusion matrix: EVALUATION.md · confusion_matrix.png

Baselines comparison

The model is meaningful only relative to alternatives. All numbers below are on the same 374-row seed-level held-out test set.

Approach	Accuracy	Weighted F1	Macro F1	Cost / 1k	Latency / call
Random	0.0668	0.0637	0.0637	$0	0 ms
Majority class	0.0588	0.0065	0.0065	$0	0 ms
Keyword regex (hand-crafted)	0.2487	0.2834	0.2834	$0	< 0.1 ms
GPT-4o-mini zero-shot	0.5722	0.5916	0.5916	~$0.05 (gpt-4o-mini Jan 2026 pricing, ~250-token prompt)	101.4 ms
AlephBERT fine-tune (ours)	0.7424 ± 0.0031	0.7357 ± 0.0039	0.7357 ± 0.0039	$0	~50 ms

The fine-tune beats GPT-4o-mini zero-shot by a meaningful margin and is free per inference and ~4× faster. Beating a strong general-purpose LLM with a 110M-parameter Hebrew BERT fine-tune is what justifies the training cost in the first place.

Training Data

Fully synthetic. Hebrew seed templates (12–20 per intent) were paraphrased via GPT-4o-mini (10 variations per seed, temperature 0.9), yielding ~2,100 labeled examples across the 17 intents. No real WhatsApp / user messages were used — there is no PII leakage risk.

Generation script and seed templates live in the standalone GitHub repo: github.com/spivi/alephbert-intent-he. A ~100-row sample is published as spivi87/alephbert-intent-he-samples.

Methodology — seed-level train/test split

The split happens at the seed level, before paraphrasing: for every intent, 2 seeds are held out from training, and the test set contains only paraphrases of those held-out seeds.

This avoids the common pitfall of LLM-generated synthetic data where paraphrases of the same seed land in both train and test (a "leak") and inflate reported accuracy. The 74.2% headline is what the model actually achieves on text whose source seed it has never seen during training.

The per-intent split is recorded in split_manifest.json in the generation output directory for full reproducibility.

Training Procedure

Setting	Value
Base model	`onlplab/alephbert-base`
Optimizer	AdamW (HF `Trainer` default)
Learning rate	`2e-5` (linear warmup, linear decay)
Batch size	16 (train) / 32 (eval)
Max sequence length	128 tokens
Max epochs	10 (early stopping on `eval_accuracy`, patience=3)
Loss	Cross-entropy
Mixed precision	fp32
Random seed	42 (Python / NumPy / PyTorch / Trainer all pinned)

Reproduce the run exactly:

python scripts/hf_classifier/generate_training_data.py \
    --output-dir data/hf_classifier --test-seeds-per-intent 2 --seed 42
python scripts/hf_classifier/train_classifier.py \
    --data-dir data/hf_classifier --output-dir <out> --seed 42

Compute & Environmental Footprint


Training compute	~10 minutes on Apple M-series GPU (PyTorch MPS backend)
Equivalent	~30 minutes on a single Google Colab T4
Estimated CO₂	< 5 g CO₂eq (single training run on personal hardware)
Software stack	`transformers >= 4.40`, `torch >= 2.3`, `datasets >= 3.0`, `accelerate >= 0.26`
Data generation	~2-5 min wall time + ~$0.02-0.05 OpenAI API spend for the synthetic corpus

Out-of-scope Use

This model is not intended for:

Arabic, English-only, or other non-Hebrew text.
Long-form text (> 128 tokens). Tokenizer truncates; the model was never trained on longer inputs.
Non-shopping domains. Treat as a fine-tuning starting point, not a drop-in classifier for customer support / news / sentiment / etc.
Safety or abuse classification. OTHER is a "doesn't fit shopping intents" bucket, not a content filter.
Mizrahi, Ashkenazi-modern, ultra-Orthodox / Haredi, or heavy code-switching Hebrew dialects — these are under-represented in synthetic GPT-4o-mini output and accuracy on them is not measured.

Label Glossary

ID	Label	English description
0	`GROCERY_REQUEST`	Add items to the shopping list
1	`RECIPE_URL`	Recipe URL — extract ingredients from a linked recipe
2	`LIST_QUERY`	Show the current shopping list
3	`CLEAR_LIST`	Mark all items as bought; clear the list
4	`REMOVE_ITEM`	Remove a specific item from the list
5	`PARTIAL_COMPLETION`	Mark most items bought except for some
6	`GROUP_INFO`	Show group members and details
7	`GET_INVITE_CODE`	Get the existing group invite code
8	`CREATE_INVITE`	Generate a new group invite code
9	`RENAME_GROUP`	Change the group name
10	`LEAVE_GROUP`	Leave the current group
11	`NOTIFICATION_SETTINGS`	Toggle notification preferences
12	`REVOKE_INVITE`	Cancel or invalidate a group invite code
13	`RECIPE_SEARCH`	Build a shopping list for a known dish
14	`UPDATE_QUANTITY`	Change the quantity of an existing item
15	`BUG_REPORT`	Report a bug or issue with the bot
16	`OTHER`	Conversational or off-topic message; not a shopping intent

Limitations

OTHER is intentionally a catch-all and the weakest class (F1 ≈ 0.50). Production routes any prediction with score < 0.7 to a generative LLM fallback. We recommend the same pattern in your downstream task: don't trust OTHER as a positive signal — trust it only as a "nothing else fired" signal.
Class imbalance: train support varies (the per-intent class sizes are roughly uniform but not exact). We report macro F1 alongside weighted F1: macro is more honest for uniform-ish classes (treats every intent equally), weighted is the one to read when you care most about the dominant classes. In our case they agree closely.
Domain-specific. Trained on grocery/shopping intents. Performance outside this domain is unknown — treat as a fine-tuning base, not a general Hebrew classifier.
No Arabic, no English-only support. Inputs assumed to be Hebrew (occasional Hebrew/English code-switching is fine — training data includes it).
Short utterances only. Trained on inputs up to 128 tokens. Long-form text is not supported.

Biases & Demographic Coverage

The synthetic training data was generated by GPT-4o-mini paraphrasing Hebrew seed templates. GPT-4o-mini produces mostly "neutral modern Hebrew" — the dialects and registers below are known coverage gaps:

Mizrahi-influenced Hebrew (vowel patterns, loanwords from Arabic / Ladino / Persian)
Ultra-Orthodox / Haredi Hebrew (religious code-switching with Yiddish and Aramaic terms)
Young / slang Hebrew (informal contractions, recent borrowings, internet jargon)
Mixed Hebrew–Russian / Hebrew–Amharic code-switching common in immigrant communities
Spelling variants that drop niqqud or use non-standard final letters

We did not evaluate on these and make no accuracy claim for them. If your production traffic includes any of the above, do your own held-out evaluation before deploying.

Attribution & License

Base model: onlplab/alephbert-base (Apache 2.0). This fine-tune is also distributed under Apache 2.0. See LICENSE and LICENSE-ALEPHBERT.md in the GitHub repo.

Citation

If you use this model in academic work, please cite the original AlephBERT paper:

@inproceedings{seker-etal-2022-alephbert,
    title     = "{A}leph{BERT}: Language Model Pre-training and Evaluation
                 from Sub-Word to Sentence Level",
    author    = "Seker, Amit and Bandel, Elron and Bareket, Dan and
                 Brusilovsky, Idan and Greenfeld, Refael and Tsarfaty, Reut",
    booktitle = "Proceedings of the 60th Annual Meeting of the ACL",
    year      = "2022",
    address   = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2022.acl-long.4",
    pages     = "46--56",
}

For the fine-tune itself:

@misc{spivakovsky2026alephbertintenthe,
  title         = {{AlephBERT Hebrew Shopping Intent Classifier}},
  author        = {Spivakovsky, Alex},
  year          = {2026},
  publisher     = {Hugging Face},
  howpublished  = {\\url{https://huggingface.co/spivi87/alephbert-intent-he}},
}

Downloads last month: 48

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for spivi87/alephbert-intent-he

Base model

onlplab/alephbert-base

Finetuned

(10)

this model

spivi87
/

alephbert-intent-he