--- license: cc-by-nc-sa-4.0 language: - tr pipeline_tag: token-classification tags: - legal --- # NER Model for Legal Texts Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an **optimized BERT architecture** using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by **NewMind**, a legal tech company in Istanbul, Turkey. In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches. --- ## Overview - **Preprint Paper**: [https://arxiv.org/abs/2407.00648](https://arxiv.org/abs/2407.00648) - **Architecture**: Optimized BERT Base - **Language**: Turkish - **Supported Labels**: - `Person` - `Law` - `Publication` - `Government` - `Corporation` - `Other` - `Project` - `Money` - `Date` - `Location` - `Court` **Model Name**: LegalTurk Optimized BERT --- ## How to Use ### Use a pipeline as a high-level helper ```python from transformers import pipeline # Load the pipeline model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple') # Input text text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir." # Get predictions predictions = model(text) print(predictions) ``` ### Load model directly ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model") model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model") text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Process logits and map predictions to labels predictions = [ (token, model.config.id2label[label.item()]) for token, label in zip( tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]), torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0] ) if token not in tokenizer.all_special_tokens ] print(predictions) ``` --- # Authors Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol --- ## License This model is shared under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors. For commercial use, please contact [zeidi.uni@gmail.com].