Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Legal-HeBERT

Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.

Training Data

Our training datasets are:

| Name | Hebrew Description | Size (GB) | Documents | Sentences | Words | Notes | |----------------------------------------------------------------------------------------------------------------------------------- |-------------------------------------------------------------------------- |----------- |----------- |------------ |------------- |----------------------------------------- | | The Israeli Law Book | 住驻专 讛讞讜拽讬诐 讛讬砖专讗诇讬 | 0.05 | 2338 | 293352 | 4851063 | | | Judgments of the Supreme Court | 诪讗讙专 驻住拽讬 讛讚讬谉 砖诇 讘讬转 讛诪砖驻讟 讛注诇讬讜谉 | 0.7 | 212348 | 5790138 | 79672415 | | | custody courts | 讛讞诇讟讜转 讘转讬 讛讚讬谉 诇诪砖诪讜专转 | 2.46 | 169,708 | 8,555,893 | 213,050,492 | | | Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment | 转讝讻讬专讬 讞讜拽, 讟讬讜讟讜转 讞拽讬拽转 诪砖谞讛 讜讟讬讜讟讜转 诪讘讞谞讬 转诪讬讻讛 砖讛讜驻爪讜 诇讛注专讜转 讛爪讬讘讜专 | 0.4 | 3,291 | 294,752 | 7,218,960 | | | Supervisors of Land Registration judgments | 诪讗讙专 驻住拽讬 讚讬谉 砖诇 讛诪驻拽讞讬诐 注诇 专讬砖讜诐 讛诪拽专拽注讬谉 | 0.02 | 559 | 67,639 | 1,785,446 | | | Decisions of the Labor Court - Corona | 诪讗讙专 讛讞诇讟讜转 讘讬转 讛讚讬谉 诇注谞讬讬谉 砖讬专讜转 讛转注住讜拽讛 鈥 拽讜专讜谞讛 | 0.001 | 146 | 3505 | 60195 | | | Decisions of the Israel Lands Council | 讛讞诇讟讜转 诪讜注爪转 诪拽专拽注讬 讬砖专讗诇 | | 118 | 11283 | 162692 | aggregate file | | Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal | 驻住拽讬 讚讬谉 砖诇 讘讬转 讛讚讬谉 诇诪砖诪注转 讜讘讬转 讛讚讬谉 诇注专注讜专讬诐 砖诇 诪砖讟专转 讬砖专讗诇 | 0.02 | 54 | 83724 | 1743419 | aggregate files | | Disciplinary Appeals Committee in the Ministry of Health | 讜注讚转 注专专 诇讚讬谉 诪砖诪注转讬 讘诪砖专讚 讛讘专讬讗讜转 | 0.004 | 252 | 21010 | 429807 | 465 files are scanned and didn't parser | | Attorney General's Positions | 诪讗讙专 讛转讬讬爪讘讜讬讜转 讛讬讜注抓 讛诪砖驻讟讬 诇诪诪砖诇讛 | 0.008 | 281 | 32724 | 813877 | | | Legal-Opinion of the Attorney General | 诪讗讙专 讞讜讜转 讚注转 讛讬讜注抓 讛诪砖驻讟讬 诇诪诪砖诇讛 | 0.002 | 44 | 7132 | 188053 | | | | | | | | | | | total | | 3.665 | 389,139 | 15,161,152 | 309,976,419 | |

We thank Yair Gardin for the referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.

Training process

  • Vocabulary size: 50,000 tokens
  • 4 epochs (1M steps卤)
  • lr=5e-5
  • mlm_probability=0.15
  • batch size = 32 (for each gpu)
  • NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)

Additional training settings:

Fine-tuned HeBERT model: The first eight layers were freezed (like Lee et al. (2019) suggest)
Legal-HeBERT trained from scratch: The training process is similar to HeBERT and inspired by Chalkidis et al. (2020)

How to use

The models can be found in huggingface hub and can be fine-tunned to any down-stream task:

# !pip install transformers==4.14.1
from transformers import AutoTokenizer, AutoModel

model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model 
model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model=model_name,
)
fill_mask("讛拽讜专讜谞讛 诇拽讞讛 讗转 [MASK] 讜诇谞讜 诇讗 谞砖讗专 讚讘专.")

Stay tuned!

We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.

If you used this model please cite us as :

Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127

@article{chriqui2021hebert,
  title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
  author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
  journal={SSRN preprint:4147127},
  year={2022}
}

Contact us

Avichay Chriqui, The Coller AI Lab
Inbal yahav, The Coller AI Lab
Ittai Bar-Siman-Tov, the BIU Innovation Lab for Law, Data-Science and Digital Ethics

Thank you, 转讜讚讛, 卮賰乇丕

Downloads last month
7