How to use this model directly from the
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator")
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found here (our repo).
This is ELECTRA-Base model's discriminator which has the same structure with BERT-Base trained on uncased Turkish dataset. This version has a vocab of size 64k, different from default 32k.
Using AutoModelWithLMHead and AutoTokenizer from Transformers, you can import the model as described below.
from transformers import AutoModel, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator", do_lower_case=False) model = AutoModelWithLMHead.from_pretrained("loodos/electra-base-turkish-64k-uncased-discriminator") normalizer = TextNormalization() normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True) tokenizer.tokenize(normalized_text)
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
2- Python's default
string.upper() make the conversions
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
We opened an issue in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (
TextNormalization() in the code snippet above) in our repo.
You contact us to ask a question, open an issue or give feedback via our github repo.
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.