stefan-it commited on
Commit
bd4d806
1 Parent(s): 6965263

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ license: mit
4
+ datasets:
5
+ - allenai/c4
6
+ ---
7
+
8
+ # 🇹🇷 Turkish ELECTRA model
9
+
10
+ <p align="center">
11
+ <img alt="Logo provided by Merve Noyan" title="Awesome logo from Merve Noyan" src="https://raw.githubusercontent.com/stefan-it/turkish-bert/master/merve_logo.png">
12
+ </p>
13
+
14
+ [![DOI](https://zenodo.org/badge/237817454.svg)](https://zenodo.org/badge/latestdoi/237817454)
15
+
16
+ We present community-driven BERT, DistilBERT, ELECTRA and ConvBERT models for Turkish 🎉
17
+
18
+ Some datasets used for pretraining and evaluation are contributed from the
19
+ awesome Turkish NLP community, as well as the decision for the BERT model name: BERTurk.
20
+
21
+ Logo is provided by [Merve Noyan](https://twitter.com/mervenoyann).
22
+
23
+ # Stats
24
+
25
+ We've also trained an ELECTRA (uncased) model on the recently released Turkish part of the
26
+ [multiligual C4 (mC4) corpus](https://github.com/allenai/allennlp/discussions/5265) from the AI2 team.
27
+
28
+ After filtering documents with a broken encoding, the training corpus has a size of 242GB resulting
29
+ in 31,240,963,926 tokens.
30
+
31
+ We used the original 32k vocab (instead of creating a new one).
32
+
33
+ # mC4 ELECTRA
34
+
35
+ In addition to the ELEC**TR**A base cased model, we also trained an ELECTRA uncased model on the Turkish part of the mC4 corpus. We use a
36
+ sequence length of 512 over the full training time and train the model for 1M steps on a v3-32 TPU.
37
+
38
+ # Model usage
39
+
40
+ All trained models can be used from the [DBMDZ](https://github.com/dbmdz) Hugging Face [model hub page](https://huggingface.co/dbmdz)
41
+ using their model name.
42
+
43
+ Example usage with 🤗/Transformers:
44
+
45
+ ```python
46
+ tokenizer = AutoTokenizer.from_pretrained("electra-base-turkish-mc4-uncased-discriminator")
47
+
48
+ model = AutoModel.from_pretrained("electra-base-turkish-mc4-uncased-discriminator")
49
+ ```
50
+
51
+ # Citation
52
+
53
+ You can use the following BibTeX entry for citation:
54
+
55
+ ```bibtex
56
+ @software{stefan_schweter_2020_3770924,
57
+ author = {Stefan Schweter},
58
+ title = {BERTurk - BERT models for Turkish},
59
+ month = apr,
60
+ year = 2020,
61
+ publisher = {Zenodo},
62
+ version = {1.0.0},
63
+ doi = {10.5281/zenodo.3770924},
64
+ url = {https://doi.org/10.5281/zenodo.3770924}
65
+ }
66
+ ```
67
+
68
+ # Acknowledgments
69
+
70
+ Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
71
+ additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
72
+ us the Turkish NER dataset for evaluation.
73
+
74
+ We would like to thank [Merve Noyan](https://twitter.com/mervenoyann) for the
75
+ awesome logo!
76
+
77
+ Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
78
+ Thanks for providing access to the TFRC ❤️