AlephBERT Hebrew Shopping Intent Classifier · סיווג כוונות בעברית

Fine-tuned AlephBERT for 17-class Hebrew intent classification in a shopping / grocery-bot context. 74.2% accuracy (mean across 3 training seeds) on a properly held-out synthetic test set, ~50 ms CPU inference via ONNX (sibling repo: spivi87/alephbert-intent-he-onnx).

Beats GPT-4o-mini zero-shot by ~17 percentage points on this task while running locally and free per inference. See ## Baselines comparison below.

Quickstart

from transformers import pipeline

clf = pipeline("text-classification", model="spivi87/alephbert-intent-he", top_k=3)
clf("תוסיף חלב וביצים")
# [{'label': 'GROCERY_REQUEST', 'score': 0.97},
#  {'label': 'OTHER',           'score': 0.02},
#  {'label': 'UPDATE_QUANTITY', 'score': 0.01}]

Inference returns named labels (not LABEL_0) — id2label is baked into config.json.

Intended Use

Primary: classify short Hebrew messages from a shopping/grocery bot into one of 17 actionable intents (add item, show list, clear list, recipe URL, group admin, etc.).

Also useful as: a fine-tuning starting point for any Hebrew text-classification task — customer support intents, news topic tagging, sentiment proxies, etc. The weights have been adapted to handle short Hebrew utterances with mixed Hebrew/English tokens, typos, and emoji — common in informal messaging.

Recommended confidence threshold: 0.7. Below that, fall back to a generative model or to an OTHER bucket — see the training script for the exact threshold logic the production deployment uses.

Performance

  • Accuracy: 0.7424 ± 0.0031 (n=3, seeds [42, 43, 44])
  • Weighted F1: 0.7357 ± 0.0039
  • Macro F1: 0.7357 ± 0.0039
  • Test samples: 374 (22 per intent — paraphrases of held-out seeds; no leakage from training)

Per-intent F1

Intent F1 (mean ± std) Support
GROCERY_REQUEST 0.767 ± 0.023 22
RECIPE_URL 0.740 ± 0.029 22
LIST_QUERY 0.793 ± 0.054 22
CLEAR_LIST 0.680 ± 0.026 22
REMOVE_ITEM 0.712 ± 0.061 22
PARTIAL_COMPLETION 0.825 ± 0.040 22
GROUP_INFO 0.590 ± 0.053 22
GET_INVITE_CODE 0.841 ± 0.029 22
CREATE_INVITE 0.635 ± 0.069 22
RENAME_GROUP 0.993 ± 0.013 22
LEAVE_GROUP 0.791 ± 0.031 22
NOTIFICATION_SETTINGS 0.615 ± 0.014 22
REVOKE_INVITE 0.900 ± 0.011 22
RECIPE_SEARCH 0.804 ± 0.028 22
UPDATE_QUANTITY 0.977 ± 0.001 22
BUG_REPORT 0.340 ± 0.011 22
OTHER 0.505 ± 0.009 22

Full evaluation report and confusion matrix: EVALUATION.md · confusion_matrix.png

Baselines comparison

The model is meaningful only relative to alternatives. All numbers below are on the same 374-row seed-level held-out test set.

Approach Accuracy Weighted F1 Macro F1 Cost / 1k Latency / call
Random 0.0668 0.0637 0.0637 $0 0 ms
Majority class 0.0588 0.0065 0.0065 $0 0 ms
Keyword regex (hand-crafted) 0.2487 0.2834 0.2834 $0 < 0.1 ms
GPT-4o-mini zero-shot 0.5722 0.5916 0.5916 ~$0.05 (gpt-4o-mini Jan 2026 pricing, ~250-token prompt) 101.4 ms
AlephBERT fine-tune (ours) 0.7424 ± 0.0031 0.7357 ± 0.0039 0.7357 ± 0.0039 $0 ~50 ms

The fine-tune beats GPT-4o-mini zero-shot by a meaningful margin and is free per inference and ~4× faster. Beating a strong general-purpose LLM with a 110M-parameter Hebrew BERT fine-tune is what justifies the training cost in the first place.

Training Data

Fully synthetic. Hebrew seed templates (12–20 per intent) were paraphrased via GPT-4o-mini (10 variations per seed, temperature 0.9), yielding ~2,100 labeled examples across the 17 intents. No real WhatsApp / user messages were used — there is no PII leakage risk.

Generation script and seed templates live in the standalone GitHub repo: github.com/spivi/alephbert-intent-he. A ~100-row sample is published as spivi87/alephbert-intent-he-samples.

Methodology — seed-level train/test split

The split happens at the seed level, before paraphrasing: for every intent, 2 seeds are held out from training, and the test set contains only paraphrases of those held-out seeds.

This avoids the common pitfall of LLM-generated synthetic data where paraphrases of the same seed land in both train and test (a "leak") and inflate reported accuracy. The 74.2% headline is what the model actually achieves on text whose source seed it has never seen during training.

The per-intent split is recorded in split_manifest.json in the generation output directory for full reproducibility.

Training Procedure

Setting Value
Base model onlplab/alephbert-base
Optimizer AdamW (HF Trainer default)
Learning rate 2e-5 (linear warmup, linear decay)
Batch size 16 (train) / 32 (eval)
Max sequence length 128 tokens
Max epochs 10 (early stopping on eval_accuracy, patience=3)
Loss Cross-entropy
Mixed precision fp32
Random seed 42 (Python / NumPy / PyTorch / Trainer all pinned)

Reproduce the run exactly:

python scripts/hf_classifier/generate_training_data.py \
    --output-dir data/hf_classifier --test-seeds-per-intent 2 --seed 42
python scripts/hf_classifier/train_classifier.py \
    --data-dir data/hf_classifier --output-dir <out> --seed 42

Compute & Environmental Footprint

Training compute ~10 minutes on Apple M-series GPU (PyTorch MPS backend)
Equivalent ~30 minutes on a single Google Colab T4
Estimated CO₂ < 5 g CO₂eq (single training run on personal hardware)
Software stack transformers >= 4.40, torch >= 2.3, datasets >= 3.0, accelerate >= 0.26
Data generation ~2-5 min wall time + ~$0.02-0.05 OpenAI API spend for the synthetic corpus

Out-of-scope Use

This model is not intended for:

  • Arabic, English-only, or other non-Hebrew text.
  • Long-form text (> 128 tokens). Tokenizer truncates; the model was never trained on longer inputs.
  • Non-shopping domains. Treat as a fine-tuning starting point, not a drop-in classifier for customer support / news / sentiment / etc.
  • Safety or abuse classification. OTHER is a "doesn't fit shopping intents" bucket, not a content filter.
  • Mizrahi, Ashkenazi-modern, ultra-Orthodox / Haredi, or heavy code-switching Hebrew dialects — these are under-represented in synthetic GPT-4o-mini output and accuracy on them is not measured.

Label Glossary

ID Label English description
0 GROCERY_REQUEST Add items to the shopping list
1 RECIPE_URL Recipe URL — extract ingredients from a linked recipe
2 LIST_QUERY Show the current shopping list
3 CLEAR_LIST Mark all items as bought; clear the list
4 REMOVE_ITEM Remove a specific item from the list
5 PARTIAL_COMPLETION Mark most items bought except for some
6 GROUP_INFO Show group members and details
7 GET_INVITE_CODE Get the existing group invite code
8 CREATE_INVITE Generate a new group invite code
9 RENAME_GROUP Change the group name
10 LEAVE_GROUP Leave the current group
11 NOTIFICATION_SETTINGS Toggle notification preferences
12 REVOKE_INVITE Cancel or invalidate a group invite code
13 RECIPE_SEARCH Build a shopping list for a known dish
14 UPDATE_QUANTITY Change the quantity of an existing item
15 BUG_REPORT Report a bug or issue with the bot
16 OTHER Conversational or off-topic message; not a shopping intent

Limitations

  • OTHER is intentionally a catch-all and the weakest class (F1 ≈ 0.50). Production routes any prediction with score < 0.7 to a generative LLM fallback. We recommend the same pattern in your downstream task: don't trust OTHER as a positive signal — trust it only as a "nothing else fired" signal.
  • Class imbalance: train support varies (the per-intent class sizes are roughly uniform but not exact). We report macro F1 alongside weighted F1: macro is more honest for uniform-ish classes (treats every intent equally), weighted is the one to read when you care most about the dominant classes. In our case they agree closely.
  • Domain-specific. Trained on grocery/shopping intents. Performance outside this domain is unknown — treat as a fine-tuning base, not a general Hebrew classifier.
  • No Arabic, no English-only support. Inputs assumed to be Hebrew (occasional Hebrew/English code-switching is fine — training data includes it).
  • Short utterances only. Trained on inputs up to 128 tokens. Long-form text is not supported.

Biases & Demographic Coverage

The synthetic training data was generated by GPT-4o-mini paraphrasing Hebrew seed templates. GPT-4o-mini produces mostly "neutral modern Hebrew" — the dialects and registers below are known coverage gaps:

  • Mizrahi-influenced Hebrew (vowel patterns, loanwords from Arabic / Ladino / Persian)
  • Ultra-Orthodox / Haredi Hebrew (religious code-switching with Yiddish and Aramaic terms)
  • Young / slang Hebrew (informal contractions, recent borrowings, internet jargon)
  • Mixed Hebrew–Russian / Hebrew–Amharic code-switching common in immigrant communities
  • Spelling variants that drop niqqud or use non-standard final letters

We did not evaluate on these and make no accuracy claim for them. If your production traffic includes any of the above, do your own held-out evaluation before deploying.

Attribution & License

Base model: onlplab/alephbert-base (Apache 2.0). This fine-tune is also distributed under Apache 2.0. See LICENSE and LICENSE-ALEPHBERT.md in the GitHub repo.

Citation

If you use this model in academic work, please cite the original AlephBERT paper:

@inproceedings{seker-etal-2022-alephbert,
    title     = "{A}leph{BERT}: Language Model Pre-training and Evaluation
                 from Sub-Word to Sentence Level",
    author    = "Seker, Amit and Bandel, Elron and Bareket, Dan and
                 Brusilovsky, Idan and Greenfeld, Refael and Tsarfaty, Reut",
    booktitle = "Proceedings of the 60th Annual Meeting of the ACL",
    year      = "2022",
    address   = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2022.acl-long.4",
    pages     = "46--56",
}

For the fine-tune itself:

@misc{spivakovsky2026alephbertintenthe,
  title         = {{AlephBERT Hebrew Shopping Intent Classifier}},
  author        = {Spivakovsky, Alex},
  year          = {2026},
  publisher     = {Hugging Face},
  howpublished  = {\\url{https://huggingface.co/spivi87/alephbert-intent-he}},
}
Downloads last month
48
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spivi87/alephbert-intent-he

Finetuned
(10)
this model

Dataset used to train spivi87/alephbert-intent-he

Space using spivi87/alephbert-intent-he 1