Edit model card

We introduce BERTurk-Legal which is a transformer-based language model to retrieve prior legal cases. BERTurk-Legal is pre-trained on a dataset from the Turkish legal domain. This dataset does not contain any labels related to the prior court case retrieval task. Masked language modeling is used to train BERTurk-Legal in a self-supervised manner. With zero-shot classification, BERTurk-Legal provides state-of-the-art results on the dataset consisting of legal cases of the Court of Cassation of Turkey. The results of the experiments show the necessity of developing language models specific to the Turkish law domain. Details of BERTurk-Legal can be found in the paper mentioned in the Citation section below.

Test dataset can be accessed from the following link: https://github.com/koc-lab/yargitay_retrieval_dataset

The model can be loaded and used to create document embeddings as follows. Then, the document embeddings can be utilized for retrieval.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

bert_model = "KocLab-Bilkent/BERTurk-Legal"

model = AutoModelForSequenceClassification.from_pretrained(bert_model, output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained(bert_model)

tokens = tokenizer("Örnek metin") # a dummy text is provided as input

output = model(tokens) 
docEmbeddings = output.hidden_states[-1]

Citation

If you use the model, please cite the following conference paper.

  @inproceedings{ozturk23berturkLegal,
    author={\"{O}zt\"{u}rk, Ceyhun E. and \"{O}z\c{c}elik, {\c{S}}. Bar{\i}\c{s} and Aykut Ko\c{c}},
    booktitle={2023 31st Signal Processing and Communications Applications Conference (SIU)}, 
    title={{A Transformer-Based Prior Legal Case Retrieval Method}}, 
    year={2023},
    volume={},
    number={},
    pages={1-4}
  }
Downloads last month
209