--- language: tr license: mit --- 🇹🇷 RoBERTaTurk ## Model description This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites. The final training corpus has a size of 38 GB and 329.720.508 sentences. Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps. # Usage Load transformers library with: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased") model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased") ``` # Fill Mask Usage ```python from transformers import pipeline fill_mask = pipeline( "fill-mask", model="burakaytan/roberta-base-turkish-uncased", tokenizer="burakaytan/roberta-base-turkish-uncased" ) fill_mask("iki ülke arasında başladı") [{'sequence': 'iki ülke arasında savaş başladı', 'score': 0.3013845384120941, 'token': 1359, 'token_str': ' savaş'}, {'sequence': 'iki ülke arasında müzakereler başladı', 'score': 0.1058429479598999, 'token': 30439, 'token_str': ' müzakereler'}, {'sequence': 'iki ülke arasında görüşmeler başladı', 'score': 0.07718811184167862, 'token': 4916, 'token_str': ' görüşmeler'}, {'sequence': 'iki ülke arasında kriz başladı', 'score': 0.07174749672412872, 'token': 3908, 'token_str': ' kriz'}, {'sequence': 'iki ülke arasında çatışmalar başladı', 'score': 0.05678590387105942, 'token': 19346, 'token_str': ' çatışmalar'}] ``` ## Citation and Related Information To cite this model: ```bibtex @inproceedings{aytan2022comparison, title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems}, author={Aytan, Burak and Sakar, C Okan}, booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)}, pages={1--4}, year={2022}, organization={IEEE} } ```