julien-c HF staff commited on
Commit
9522a1b
β€’
1 Parent(s): ea32b53

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md

Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: tr
3
+ license: mit
4
+ ---
5
+
6
+ # πŸ€— + πŸ“š dbmdz Turkish BERT model
7
+
8
+ In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State
9
+ Library open sources an uncased model for Turkish πŸŽ‰
10
+
11
+ # πŸ‡ΉπŸ‡· BERTurk
12
+
13
+ BERTurk is a community-driven uncased BERT model for Turkish.
14
+
15
+ Some datasets used for pretraining and evaluation are contributed from the
16
+ awesome Turkish NLP community, as well as the decision for the model name: BERTurk.
17
+
18
+ ## Stats
19
+
20
+ The current version of the model is trained on a filtered and sentence
21
+ segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),
22
+ a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a
23
+ special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).
24
+
25
+ The final training corpus has a size of 35GB and 44,04,976,662 tokens.
26
+
27
+ Thanks to Google's TensorFlow Research Cloud (TFRC) we could train an uncased model
28
+ on a TPU v3-8 for 2M steps.
29
+
30
+ For this model we use a vocab size of 128k.
31
+
32
+ ## Model weights
33
+
34
+ Currently only PyTorch-[Transformers](https://github.com/huggingface/transformers)
35
+ compatible weights are available. If you need access to TensorFlow checkpoints,
36
+ please raise an issue!
37
+
38
+ | Model | Downloads
39
+ | -------------------------------------- | ---------------------------------------------------------------------------------------------------------------
40
+ | `dbmdz/bert-base-turkish-128k-uncased` | [`config.json`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-128k-uncased/config.json) β€’ [`pytorch_model.bin`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-128k-uncased/pytorch_model.bin) β€’ [`vocab.txt`](https://cdn.huggingface.co/dbmdz/bert-base-turkish-128k-uncased/vocab.txt)
41
+
42
+ ## Usage
43
+
44
+ With Transformers >= 2.3 our BERTurk uncased model can be loaded like:
45
+
46
+ ```python
47
+ from transformers import AutoModel, AutoTokenizer
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")
50
+ model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-128k-uncased")
51
+ ```
52
+
53
+ ## Results
54
+
55
+ For results on PoS tagging or NER tasks, please refer to
56
+ [this repository](https://github.com/stefan-it/turkish-bert).
57
+
58
+ # Huggingface model hub
59
+
60
+ All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
61
+
62
+ # Contact (Bugs, Feedback, Contribution and more)
63
+
64
+ For questions about our BERT models just open an issue
65
+ [here](https://github.com/dbmdz/berts/issues/new) πŸ€—
66
+
67
+ # Acknowledgments
68
+
69
+ Thanks to [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/) for providing us
70
+ additional large corpora for Turkish. Many thanks to Reyyan Yeniterzi for providing
71
+ us the Turkish NER dataset for evaluation.
72
+
73
+ Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
74
+ Thanks for providing access to the TFRC ❀️
75
+
76
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
77
+ it is possible to download both cased and uncased models from their S3 storage πŸ€—