license: cc-by-sa-4.0
datasets:
- HaifaCLGroup/KnessetCorpus
language:
- he
tags:
- hebrew
- nlp
- masked-language-model
- transformers
- BERT
- parliamentary-proceedings
- language-model
- Knesset
- DictaBERT
- fine-tuning
Knesset-DictaBERT
Knesset-DictaBERT is a Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings.
This model is based on the Dicta-BERT architecture and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.
Model Details
- Model type: BERT-based (Bidirectional Encoder Representations from Transformers)
- Language: Hebrew
- Training Data: Knesset Corpus (Israeli parliamentary proceedings)
- Base Model: Dicta-BERT
Training Procedure
The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")
model.eval()
sentence = "ืืฉ ืื ื [MASK] ืขื ืื ืืฉืืืข ืืื"
# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)
mask_token_index = 3
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]
# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))
# Example output: ืืฉืืื / ืืืื
Evaluation
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences. The perplexity was calculated on this full test set. Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens).
Perplexity
The perplexity of the original DictaBERT on the full test set is 22.87.
The perplexity of Knesset-DictaBERT on the full test set is 6.60.
Accuracy
- 1-accuracy results
Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.
The original DictaBERT model achieved a top-1 accuracy of 48.02%.
- 2-accuracy results
Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.
The original DictaBERT model achieved a top-2 accuracy of 58.60%.
- 5-accuracy results
- Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.
The original DictaBERT model achieved a top-5 accuracy of 68.98%.
Acknowledgments
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.
Citation
If you use this model in your work, please cite:
@misc{goldin2024knessetdictaberthebrewlanguagemodel,
title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
author={Gili Goldin and Shuly Wintner},
year={2024},
eprint={2407.20581},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.20581},
}