File size: 2,762 Bytes
6e570a2
 
77adea8
f5ed91e
 
2087940
 
 
d398ea2
 
 
f5ed91e
 
 
6e570a2
 
466a681
 
 
 
 
 
311e653
 
466a681
 
 
311e653
466a681
bbb0158
466a681
 
 
 
311e653
466a681
 
 
 
e7ca710
466a681
e7ca710
466a681
e7ca710
466a681
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d191dff
 
 
 
 
 
 
 
 
466a681
 
 
f5ed91e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
widget:
- pipeline_tag: Fill-Mask
- text: gelirken bir litre [MASK] aldım.
  example_title: ürün

pipeline_tag: fill-mask

tags:
- Turkish
- turkish
license: mit
language:
- tr
---

# turkish-tiny-bert-uncased

This is a Turkish Tiny uncased BERT model, developed to fill the gap for small-sized BERT models for Turkish. Since this model is uncased: it does not make a difference between turkish and Turkish. 

#### ⚠ Uncased use requires manual lowercase conversion

 
**Don't** use the `do_lower_case = True` flag with the tokenizer. Instead, convert your text to lower case as follows: 
```python
text.replace("I", "ı").lower()
```
This is due to a [known issue](https://github.com/huggingface/transformers/issues/6680) with the tokenizer.

Be aware that this model may exhibit biased predictions as it was trained primarily on crawled data, which inherently can contain various biases.

Other relevant information can be found in the [paper](https://arxiv.org/abs/2307.14134). 


## Example Usage
```python
from transformers import AutoTokenizer, BertForMaskedLM
from transformers import pipeline

model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-tiny-bert-uncased")
# or
# model = BertForMaskedLM.from_pretrained("ytu-ce-cosmos/turkish-tiny-bert-uncased", from_tf = True)

tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/turkish-tiny-bert-uncased")

unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
unmasker("gelirken bir litre [MASK] aldım.")
# [{'score': 0.202457457780838,
#   'token': 2417,
#   'token_str': 'su',
#   'sequence': 'gelirken bir litre su aldım.'},
#  {'score': 0.09290537238121033,
#   'token': 11818,
#   'token_str': 'benzin',
#   'sequence': 'gelirken bir litre benzin aldım.'},
#  {'score': 0.07785643637180328,
#   'token': 2026,
#   'token_str': '##den',
#   'sequence': 'gelirken bir litreden aldım.'},
#  {'score': 0.06889808923006058,
#   'token': 2299,
#   'token_str': '##yi',
#   'sequence': 'gelirken bir litreyi aldım.'},
#  {'score': 0.03152570128440857,
#   'token': 2647,
#   'token_str': '##ye',
#   'sequence': 'gelirken bir litreye aldım.'}]
```


# Acknowledgments
- Research supported with Cloud TPUs from [Google's TensorFlow Research Cloud](https://sites.research.google/trc/about/) (TFRC). Thanks for providing access to the TFRC ❤️
- Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗

# Citations
```bibtex
@article{kesgin2023developing,
  title={Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models},
  author={Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
  journal={arXiv preprint arXiv:2307.14134},
  year={2023}
}
```

# License

MIT