Edit model card

🇹🇷 RoBERTaTurk

Model description

This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites.

The final training corpus has a size of 38 GB and 329.720.508 sentences.

Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.


Load transformers library with:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased")
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased")

Fill Mask Usage

from transformers import pipeline

fill_mask = pipeline(

fill_mask("iki ülke arasında <mask> başladı")

[{'sequence': 'iki ülke arasında savaş başladı',
  'score': 0.3013845384120941,
  'token': 1359,
  'token_str': ' savaş'},
 {'sequence': 'iki ülke arasında müzakereler başladı',
  'score': 0.1058429479598999,
  'token': 30439,
  'token_str': ' müzakereler'},
 {'sequence': 'iki ülke arasında görüşmeler başladı',
  'score': 0.07718811184167862,
  'token': 4916,
  'token_str': ' görüşmeler'},
 {'sequence': 'iki ülke arasında kriz başladı',
  'score': 0.07174749672412872,
  'token': 3908,
  'token_str': ' kriz'},
 {'sequence': 'iki ülke arasında çatışmalar başladı',
  'score': 0.05678590387105942,
  'token': 19346,
  'token_str': ' çatışmalar'}]

Citation and Related Information

To cite this model:

  title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems},
  author={Aytan, Burak and Sakar, C Okan},
  booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)},
Downloads last month
Hosted inference API
Mask token: <mask>
This model can be loaded on the Inference API on-demand.