pantheon commited on
Commit
da0f26f
1 Parent(s): 5606503

README added

Browse files

Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ ---
4
+
5
+ # Turkish Language Models with Huggingface's Transformers
6
+
7
+ As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
8
+
9
+ # Turkish BERT-Base (uncased)
10
+
11
+ This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
12
+
13
+ ## Usage
14
+
15
+ Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
16
+
17
+ ```python
18
+ from transformers import AutoModel, AutoTokenizer
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
21
+
22
+ model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
23
+
24
+ normalizer = TextNormalization()
25
+ normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
26
+
27
+ tokenizer.tokenize(normalized_text)
28
+ ```
29
+
30
+ ### Notes on Tokenizers
31
+ Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
32
+
33
+ 1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
34
+
35
+ 2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
36
+
37
+ - "I" and "İ" to 'i'
38
+ - 'i' and 'ı' to 'I'
39
+
40
+ respectively. However, in Turkish, 'I' and 'İ' are two different letters.
41
+
42
+ We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
43
+
44
+
45
+ ## Details and Contact
46
+
47
+ You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
48
+
49
+ ## Acknowledgments
50
+
51
+ Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
52
+
53
+