DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

State-of-the-art language model for Hebrew, released here.

This is the fine-tuned model for the lemmatization task.

For the bert-base models for other tasks, see here.

General guidelines for how the lemmatizer works:

Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary.

If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token [BLANK]. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the dictabert-seg tool.
For verbs the lexeme is the 3rd person past singular form.

This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.

Sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-lex')
model = AutoModel.from_pretrained('dicta-il/dictabert-lex', trust_remote_code=True)

model.eval()

sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
print(model.predict([sentence], tokenizer))

Output:

[
  [
    [
      "בשנת",
      "שנה"
    ],
    [
      "1948",
      "1948"
    ],
    [
      "השלים",
      "השלים"
    ],
    [
      "אפרים",
      "אפרים"
    ],
    [
      "קישון",
      "קישון"
    ],
    [
      "את",
      "את"
    ],
    [
      "לימודיו",
      "לימוד"
    ],
    [
      "בפיסול",
      "פיסול"
    ],
    [
      "מתכת",
      "מתכת"
    ],
    [
      "ובתולדות",
      "תולדה"
    ],
    [
      "האמנות",
      "אומנות"
    ],
    [
      "והחל",
      "החל"
    ],
    [
      "לפרסם",
      "פרסם"
    ],
    [
      "מאמרים",
      "מאמר"
    ],
    [
      "הומוריסטיים",
      "הומוריסטי"
    ]
  ]
]

Citation

If you use DictaBERT-lex in your research, please cite MRL Parsing without Tears: The Case of Hebrew

BibTeX:

@misc{shmidman2024mrl,
      title={MRL Parsing Without Tears: The Case of Hebrew}, 
      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty},
      year={2024},
      eprint={2403.06970},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

dicta-il
/

dictabert-lex

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

General guidelines for how the lemmatizer works:

Citation

License

Collection including dicta-il/dictabert-lex

DictaBERT