Back to all models
fill-mask mask_token: [MASK]
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚡️ Upgrade your account to access the Inference API

							$
							curl -X POST \
-H "Authorization: Bearer YOUR_ORG_OR_USER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '"json encoded string"' \
https://api-inference.huggingface.co/models/loodos/electra-base-turkish-64k-uncased-discriminator
Share Copied link to clipboard

Monthly model downloads

loodos/electra-base-turkish-64k-uncased-discriminator loodos/electra-base-turkish-64k-uncased-discriminator
462 downloads
last 30 days

pytorch

tf

Contributed by

loodos loodos
6 models

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")

Turkish Language Models with Huggingface's Transformers

As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found here (our repo).

Turkish ELECTRA-Base-discriminator (uncased/64k)

This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k.

Usage

Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.

from transformers import AutoModel, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False)

model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")

normalizer = TextNormalization()
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)

tokenizer.tokenize(normalized_text)

Notes on Tokenizers

Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.

1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.

2- Python's default string.lower() and string.upper() make the conversions

  • "I" and "İ" to 'i'
  • 'i' and 'ı' to 'I'

respectively. However, in Turkish, 'I' and 'İ' are two different letters.

We opened an issue in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (TextNormalization() in the code snippet above) in our repo.

Details and Contact

You contact us to ask a question, open an issue or give feedback via our github repo.

Acknowledgments

Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.