uygarkurt's picture
Update README.md
d00ff50
metadata
license: mit
language:
  - tr
tags:
  - punctuation restoration
  - punctuation prediction
widget:
  - text: >-
      Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar
      Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor
      istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon
      medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya
      kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik
      Dönem başladı

Transformer Based Punctuation Restoration Models for Turkish

Liked our work? give us a ⭐ on GitHub!

You can find the BERT model used in the paper Transformer Based Punctuation Restoration for Turkish. Aim of this work is correctly place pre-decided punctuation marks in a given text. We present three pre-trained transformer models to predict period(.), comma(,) and question(?) marks for the Turkish language.

Usage

Inference

Recommended usage is via HuggingFace. You can run an inference using the pre-trained BERT model with the following code:

from transformers import pipeline

pipe = pipeline(task="token-classification", model="uygarkurt/convbert-restore-punctuation-turkish")

sample_text = "Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik Dönem başladı"

out = pipe(sample_text)

To use a different pre-trained model you can just replace the model argument with one of the other available models we provided.

Data

Dataset is provided in data/ directory as train, validation and test splits.

Dataset can be summarized as below:

Split Total Period (.) Comma (,) Question (?)
Train 1471806 124817 98194 9816
Validation 180326 15306 11980 1199
Test 182487 15524 12242 1255

Available Models

We experimented with BERT, ELECTRA and ConvBERT. Pre-trained models can be accessed via Huggingface.

BERT: https://huggingface.co/uygarkurt/bert-restore-punctuation-turkish
ELECTRA: https://huggingface.co/uygarkurt/electra-restore-punctuation-turkish
ConvBERT: https://huggingface.co/uygarkurt/convbert-restore-punctuation-turkish

Results

Precision and Recall and F1 scores for each model and punctuation mark are summarized below.

Model PERIOD COMMA QUESTION OVERALL
Score Type P R F1 P R F1 P R F1 P R F1
BERT 0.972602 0.947504 0.959952 0.576145 0.700010 0.632066 0.927642 0.911342 0.919420 0.825506 0.852952 0.837146
ELECTRA 0.972602 0.948689 0.960497 0.576800 0.710208 0.636590 0.920325 0.921074 0.920699 0.823242 0.859990 0.839262
ConvBERT 0.972731 0.946791 0.959585 0.576964 0.708124 0.635851 0.922764 0.913849 0.918285 0.824153 0.856254 0.837907

Citation

@INPROCEEDINGS{10286690,
    author={Kurt, Uygar and Çayır, Aykut},
    booktitle={2023 8th International Conference on Computer Science and Engineering (UBMK)}, 
    title={Transformer Based Punctuation Restoration for Turkish}, 
    year={2023},
    volume={},
    number={},
    pages={169-174},
    doi={10.1109/UBMK59864.2023.10286690}
}