imvladikon
/

alephbertgimmel-small-128

+---
+language:
+- he
+tags:
+- language model
+pipeline_tag: feature-extraction
+---
+## AlephBertGimmel
+Modern Hebrew pretrained BERT model with a 128K token vocabulary.
+[Checkpoint](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel/tree/main/alephbertgimmel-small/ckpt_29400--Max128Seq) of the alephbertgimmel-small-128 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-small-128")
+tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-small-128")
+text = "{} היא מטרופולין המהווה את מרכז הכלכלה"
+input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
+mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
+token_logits = model(input).logits
+mask_token_logits = token_logits[0, mask_token_index, :]
+top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+for token in top_5_tokens:
+    print(text.format(tokenizer.decode([token])))
+# ישראל היא מטרופולין המהווה את מרכז הכלכלה
+# ירושלים היא מטרופולין המהווה את מרכז הכלכלה
+# חיפה היא מטרופולין המהווה את מרכז הכלכלה
+# אילת היא מטרופולין המהווה את מרכז הכלכלה
+# אשדוד היא מטרופולין המהווה את מרכז הכלכלה
+```
+```python
+def ppl_naive(text, model, tokenizer):
+    input = tokenizer.encode(text, return_tensors="pt")
+    loss = model(input, labels=input)[0]
+    return torch.exp(loss).item()
+text = """{} היא עיר הבירה של מדינת ישראל, והעיר הגדולה ביותר בישראל בגודל האוכלוסייה"""
+for word in ["חיפה", "ירושלים", "תל אביב"]:
+  print(ppl_naive(text.format(word), model, tokenizer))
+# 9.825098991394043
+# 10.594215393066406
+# 9.536449432373047
+# I'd expect that for "ירושלים" should be the smallest value, but...
+@torch.inference_mode()
+def ppl_pseudo(text, model, tokenizer, ignore_idx=-100):
+    input = tokenizer.encode(text, return_tensors='pt')
+    mask = torch.ones(input.size(-1) - 1).diag(1)[:-2]
+    repeat_input = input.repeat(input.size(-1) - 2, 1)
+    input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
+    labels = repeat_input.masked_fill(input != tokenizer.mask_token_id, ignore_idx)
+    loss = model(input, labels=labels)[0]
+    return torch.exp(loss).item()
+for word in ["חיפה", "ירושלים", "תל אביב"]:
+    print(ppl_pseudo(text.format(word), model, tokenizer))
+# 4.346900939941406
+# 3.292382001876831
+# 2.732590913772583
+```
+When using AlephBertGimmel, please reference:
+```bibtex
+@misc{guetta2022large,
+      title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
+      author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
+      year={2022},
+      eprint={2211.15199},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```