stefan-it commited on
Commit
77776a3
1 Parent(s): e5b68db

readme: add initial version with awesome logo from Merve!

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ license: mit
4
+ datasets:
5
+ - allenai/c4
6
+ widget:
7
+ - text: "Dünyanın en eski şehirlerinden biri olan [MASK]"
8
+ ---
9
+
10
+ # 🇹🇷 Turkish ELECTRA model
11
+
12
+ <p align="center">
13
+ <img alt="Logo provided by Merve Noyan" title="Awesome logo from Merve Noyan" src="https://raw.githubusercontent.com/stefan-it/turkish-bert/updates/merve_logo.png">
14
+ </p>
15
+
16
+ [![DOI](https://zenodo.org/badge/237817454.svg)](https://zenodo.org/badge/latestdoi/237817454)
17
+
18
+ We present community-driven BERT, DistilBERT, ELECTRA and ConvBERT models for Turkish 🎉
19
+
20
+ Some datasets used for pretraining and evaluation are contributed from the
21
+ awesome Turkish NLP community, as well as the decision for the BERT model name: BERTurk.
22
+
23
+ Logo is provided by [Merve Noyan](https://twitter.com/mervenoyann).
24
+
25
+ # Stats
26
+
27
+ We've also trained an ELECTRA (cased) model on the recently released Turkish part of the
28
+ [multiligual C4 (mC4) corpus](https://github.com/allenai/allennlp/discussions/5265) from the AI2 team.
29
+
30
+ After filtering documents with a broken encoding, the training corpus has a size of 242GB resulting
31
+ in 31,240,963,926 tokens.
32
+
33
+ We used the original 32k vocab (instead of creating a new one).
34
+
35
+ # mC4 ELECTRA
36
+
37
+ In addition to the ELEC**TR**A base model, we also trained an ELECTRA model on the Turkish part of the mC4 corpus. We use a
38
+ sequence length of 512 over the full training time and train the model for 1M steps on a v3-32 TPU.
39
+
40
+ # Model usage
41
+
42
+ All trained models can be used from the [DBMDZ](https://github.com/dbmdz) Hugging Face [model hub page](https://huggingface.co/dbmdz)
43
+ using their model name.
44
+
45
+ Example usage with 🤗/Transformers:
46
+
47
+ ```python
48
+ tokenizer = AutoTokenizer.from_pretrained("dbmdz/electra-small-turkish-mc4-cased-generator")
49
+
50
+ model = AutoModel.from_pretrained("dbmdz/electra-small-turkish-mc4-cased-generator")
51
+ ```
52
+
53
+ # Citation
54
+
55
+ You can use the following BibTeX entry for citation:
56
+
57
+ ```bibtex
58
+ @software{stefan_schweter_2020_3770924,
59
+ author = {Stefan Schweter},
60
+ title = {BERTurk - BERT models for Turkish},
61
+ month = apr,
62
+ year = 2020,
63
+ publisher = {Zenodo},
64
+ version = {1.0.0},
65
+ doi = {10.5281/zenodo.3770924},
66
+ url = {https://doi.org/10.5281/zenodo.3770924}
67
+ }
68
+ ```
69
+
70
+ # Acknowledgments
71
+
72
+ Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
73
+ additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
74
+ us the Turkish NER dataset for evaluation.
75
+
76
+ We would like to thank [Merve Noyan](https://twitter.com/mervenoyann) for the
77
+ awesome logo!
78
+
79
+ Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
80
+ Thanks for providing access to the TFRC ❤️