julien-c HF staff commited on
Commit
fa6e0fa
β€’
1 Parent(s): de8e871

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/dbmdz/bert-base-turkish-uncased/README.md

Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ license: mit
4
+ ---
5
+
6
+ # πŸ€— + πŸ“š dbmdz Turkish BERT model
7
+
8
+ In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
9
+ Library open sources an uncased model for Turkish πŸŽ‰
10
+
11
+ # πŸ‡ΉπŸ‡· BERTurk
12
+
13
+ BERTurk is a community-driven uncased BERT model for Turkish.
14
+
15
+ Some datasets used for pretraining and evaluation are contributed from the
16
+ awesome Turkish NLP community, as well as the decision for the model name: BERTurk.
17
+
18
+ ## Stats
19
+
20
+ The current version of the model is trained on a filtered and sentence
21
+ segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
22
+ a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
23
+ special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
24
+
25
+ The final training corpus has a size of 35GB and 44,04,976,662 tokens.
26
+
27
+ Thanks to Google's TensorFlow Research Cloud (TFRC) we could train an uncased model
28
+ on a TPU v3-8 for 2M steps.
29
+
30
+ ## Model weights
31
+
32
+ Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
33
+ compatible weights are available. If you need access to TensorFlow checkpoints,
34
+ please raise an issue!
35
+
36
+ | Model | Downloads
37
+ | --------------------------------- | ---------------------------------------------------------------------------------------------------------------
38
+ | `dbmdz/bert-base-turkish-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-uncased/config.json) β€’ [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-uncased/pytorch_model.bin) β€’ [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-uncased/vocab.txt)
39
+
40
+ ## Usage
41
+
42
+ With Transformers >= 2.3 our BERTurk uncased model can be loaded like:
43
+
44
+ ```python
45
+ from transformers import AutoModel, AutoTokenizer
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-uncased")
48
+ model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-uncased")
49
+ ```
50
+
51
+ ## Results
52
+
53
+ For results on PoS tagging or NER tasks, please refer to
54
+ [this repository](https://github.com/stefan-it/turkish-bert).
55
+
56
+ # Huggingface model hub
57
+
58
+ All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
59
+
60
+ # Contact (Bugs, Feedback, Contribution and more)
61
+
62
+ For questions about our BERT models just open an issue
63
+ [here](https://github.com/dbmdz/berts/issues/new) πŸ€—
64
+
65
+ # Acknowledgments
66
+
67
+ Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
68
+ additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
69
+ us the Turkish NER dataset for evaluation.
70
+
71
+ Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
72
+ Thanks for providing access to the TFRC ❀️
73
+
74
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
75
+ it is possible to download both cased and uncased models from their S3 storage πŸ€—