pantheon
commited on
Commit
•
da0f26f
1
Parent(s):
5606503
README added
Browse files
README.md
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: tr
|
3 |
+
---
|
4 |
+
|
5 |
+
# Turkish Language Models with Huggingface's Transformers
|
6 |
+
|
7 |
+
As R&D Team at Loodos, we release cased and uncased versions of most recent language models for Turkish. More details about pretrained models and evaluations on downstream tasks can be found [here (our repo)](https://github.com/Loodos/turkish-language-models).
|
8 |
+
|
9 |
+
# Turkish BERT-Base (uncased)
|
10 |
+
|
11 |
+
This is BERT-Base model which has 12 encoder layers with 768 hidden layer size trained on uncased Turkish dataset.
|
12 |
+
|
13 |
+
## Usage
|
14 |
+
|
15 |
+
Using AutoModel and AutoTokenizer from Transformers, you can import the model as described below.
|
16 |
+
|
17 |
+
```python
|
18 |
+
from transformers import AutoModel, AutoTokenizer
|
19 |
+
|
20 |
+
tokenizer = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
|
21 |
+
|
22 |
+
model = AutoModel.from_pretrained("loodos/bert-base-turkish-uncased")
|
23 |
+
|
24 |
+
normalizer = TextNormalization()
|
25 |
+
normalized_text = normalizer.normalize(text, do_lower_case=True, is_turkish=True)
|
26 |
+
|
27 |
+
tokenizer.tokenize(normalized_text)
|
28 |
+
```
|
29 |
+
|
30 |
+
### Notes on Tokenizers
|
31 |
+
Currently, Huggingface's tokenizers (which were written in Python) have a bug concerning letters "ı, i, I, İ" and non-ASCII Turkish specific letters. There are two reasons.
|
32 |
+
|
33 |
+
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD. NFD/NFKD normalization changes text that contains Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and loss of information. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.
|
34 |
+
|
35 |
+
2- Python's default ```string.lower()``` and ```string.upper()``` make the conversions
|
36 |
+
|
37 |
+
- "I" and "İ" to 'i'
|
38 |
+
- 'i' and 'ı' to 'I'
|
39 |
+
|
40 |
+
respectively. However, in Turkish, 'I' and 'İ' are two different letters.
|
41 |
+
|
42 |
+
We opened an [issue](https://github.com/huggingface/transformers/issues/6680) in Huggingface's github repo about this bug. Until it is fixed, in case you want to train your model with uncased data, we provide a simple text normalization module (`TextNormalization()` in the code snippet above) in our [repo](https://github.com/Loodos/turkish-language-models).
|
43 |
+
|
44 |
+
|
45 |
+
## Details and Contact
|
46 |
+
|
47 |
+
You contact us to ask a question, open an issue or give feedback via our github [repo](https://github.com/Loodos/turkish-language-models).
|
48 |
+
|
49 |
+
## Acknowledgments
|
50 |
+
|
51 |
+
Many thanks to TFRC Team for providing us cloud TPUs on Tensorflow Research Cloud to train our models.
|
52 |
+
|
53 |
+
|