Shoshan โ€” Hebrew lemmatizer (weights)

A context-aware Hebrew lemmatizer that does not hallucinate. It retrieves the lemma from a fixed bank, and when the top retrieval is morphologically implausible for the surface form, transduces it with a learned, form-relative edit script. Every output is a real bank entry or a bounded edit of the input word, so the model cannot emit a free-form string.

Trained only on the openly redistributable Knesset + Wikipedia portions of the IAHLT Hebrew UD treebank, plus public Hebrew lexicons.

Contents

folder what
model/ the fine-tuned encoder (DictaBERT backbone) + POS head + edit-script head and inventory
bank/ the pre-encoded lemma bank (lemmas.csv + lemmas.npy, ~118k lemmas)

Usage

pip install git+https://github.com/ivrit/shoshan.git
from shoshan import Lemmatizer

lz = Lemmatizer.from_pretrained()        # pulls these weights, then caches
lz.lemma("ื”ืžื˜ืขื ื™ื", "ื”ื•ื ืคืจืง ืืช ื”ืžื˜ืขื ื™ื ืžื”ืžืฉืื™ืช.")   # -> ืžื˜ืขืŸ

Results (out-of-domain, held-out registers)

  • Lemma accuracy 92.4% out-of-domain (94.3% in-domain).
  • Bยณ consistency leads DictaBERT-lex on both precision and recall (0.965 / 0.953 vs 0.906 / 0.932).
  • 0.0% low-overlap errors on unseen words, vs 12.3% for DictaBERT-lex (which predicts each lemma as a single token from its vocabulary).

DictaBERT-lex was trained on more data than is used here, including the domains held out for evaluation, so the comparison is conservative.

License and credit

Code: MIT. The encoder is fine-tuned from DictaBERT (dicta-il/dictabert) and is subject to that model's license. The lemma bank is derived from a public Hebrew lemma lexicon and the MILA morphological lexicon; see the code repository's docs/DATA_STATEMENT.md for provenance and terms. We thank Avner Algom and the IAHLT for the treebank data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using noamor/shoshan 1