🇹🇷 RoBERTaTurkish

Model description

This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites.

The final training corpus has a size of 38 GB and 329.720.508 sentences.

As Turkcell, we trained the model on an Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz with 256GB RAM and 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.

Usage

Load transformers library with:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("TURKCELL/roberta-base-turkish-uncased")
model = AutoModelForMaskedLM.from_pretrained("TURKCELL/roberta-base-turkish-uncased")

Fill Mask Usage

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="TURKCELL/roberta-base-turkish-uncased",
    tokenizer="TURKCELL/roberta-base-turkish-uncased"
)

fill_mask("iki ülke arasında <mask> başladı")

[{'sequence': 'iki ülke arasında savaş başladı',
  'score': 0.3013845384120941,
  'token': 1359,
  'token_str': ' savaş'},
 {'sequence': 'iki ülke arasında müzakereler başladı',
  'score': 0.1058429479598999,
  'token': 30439,
  'token_str': ' müzakereler'},
 {'sequence': 'iki ülke arasında görüşmeler başladı',
  'score': 0.07718811184167862,
  'token': 4916,
  'token_str': ' görüşmeler'},
 {'sequence': 'iki ülke arasında kriz başladı',
  'score': 0.07174749672412872,
  'token': 3908,
  'token_str': ' kriz'},
 {'sequence': 'iki ülke arasında çatışmalar başladı',
  'score': 0.05678590387105942,
  'token': 19346,
  'token_str': ' çatışmalar'}]
Downloads last month
409
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for TURKCELL/roberta-base-turkish-uncased

Finetunes
1 model