Knesset-DictaBERT / README.md
GiliGold's picture
Update README.md
fe37b82 verified
|
raw
history blame
3.51 kB
metadata
license: cc-by-sa-4.0
datasets:
  - HaifaCLGroup/KnessetCorpus
language:
  - he
tags:
  - hebrew
  - nlp
  - masked-language-model
  - transformers
  - BERT
  - parliamentary-proceedings
  - language-model
  - Knesset
  - DictaBERT
  - fine-tuning

Knesset-DictaBERT

Knesset-DictaBERT is a Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings.

This model is based on the Dicta-BERT architecture and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.

Model Details

  • Model type: BERT-based (Bidirectional Encoder Representations from Transformers)
  • Language: Hebrew
  • Training Data: Knesset Corpus (Israeli parliamentary proceedings)
  • Base Model: Dicta-BERT

Training Procedure

The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")

model.eval()

sentence = "ื™ืฉ ืœื ื• [MASK] ืขืœ ื–ื” ื‘ืฉื‘ื•ืข ื”ื‘ื"

# Tokenize the input sentence and get predictions
inputs = tokenizer.encode(sentence, return_tensors='pt')
output = model(inputs)

mask_token_index = 3
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]

# Convert token IDs to tokens and print them
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))

# Example output: ื™ืฉื™ื‘ื” / ื“ื™ื•ืŸ

Evaluation

The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences. The perplexity was calculated on this full test set. Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens).

Perplexity

The perplexity of the original DictaBERT on the full test set is 22.87. The perplexity of Knesset-DictaBERT on the full test set is 6.60.

Accuracy

  • 1-accuracy results

Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases. The original DictaBERT model achieved a top-1 accuracy of 48.02%.

  • 2-accuracy results

Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases. The original Dicta model achieved a top-2 accuracy of 58.60%.

  • 5-accuracy results Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases. The original Dicta model achieved a top-5 accuracy of 68.98%.

Acknowledgments

This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.

Citation

If you use this model in your work, please cite:

@misc{Knesset-DictaBERT,
  author = {Gili Goldin},
  title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}},
}