Edit model card

🇹🇷 RoBERTaTurk-Small-Clean

Model description

It was trained with a clean dataset free of typos.

This is a Turkish small clean RoBERTa model, trained to understand Turkish language better. We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites. First, we had 38 GB of data, but we took out all the sentences with mistakes in them. So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors.

The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish.

It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors. Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps.

Usage

Load transformers library with:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")

Fill Mask Usage

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="burakaytan/roberta-small-turkish-clean-uncased",
    tokenizer="burakaytan/roberta-small-turkish-clean-uncased"
)

fill_mask("iki ülke arasında <mask> başladı")

[{'sequence': 'iki ülke arasında savaş başladı',
  'score': 0.14830906689167023,
  'token': 1745,
  'token_str': ' savaş'},
 {'sequence': 'iki ülke arasında çatışmalar başladı',
  'score': 0.1442396193742752,
  'token': 18223,
  'token_str': ' çatışmalar'},
 {'sequence': 'iki ülke arasında gerginlik başladı',
  'score': 0.12025047093629837,
  'token': 13638,
  'token_str': ' gerginlik'},
 {'sequence': 'iki ülke arasında çatışma başladı',
  'score': 0.0615813322365284,
  'token': 5452,
  'token_str': ' çatışma'},
 {'sequence': 'iki ülke arasında görüşmeler başladı',
  'score': 0.04512731358408928,
  'token': 4736,
  'token_str': ' görüşmeler'}]

Citation and Related Information

To cite this model:


@article{aytan2023deep,
  title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model},
  author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN},
  journal={Turkish Journal of Electrical Engineering and Computer Sciences},
  volume={31},
  number={3},
  pages={581--595},
  year={2023}
}
Downloads last month
24